Removing tumour purity, library size and batch effects from the TCGA breast cancer RNA-seq data using RUV-III-PRPS
Last updated on: 2022-04-29
1 Introduction
Effective removal of unwanted variation is essential to derive meaningful biological results from RNA-seq data, particularly when the data comes from large and complex studies. We have previously proposed a new method, removing unwanted variation III (RUV-III) to normalize gene expression data (R.Molania, NAR, 2019). The RUV-III method requires well-designed technical replicates (well-distributed across sources of unwanted variation) and negative control genes to estimate known and unknown sources of unwanted variation and remove it from the data.
We propose a novel strategy, pseudo-replicates of pseudo-samples (PRPS) R.Molania, bioRxiv, 2021, for deploying RUV-III to normalize RNA-seq data in situations when technical replicates are not available or are not well-designed. Our approach requires at least one roughly known biologically homogenous subclass of samples presented across sources of unwanted variation. For example, in a cancer RNA-seq study where there are normal tissues present across all sources of unwanted variation. Then, we can use these samples to create PRPS.
To create PRPS, we first need to identify the sources of unwanted variation, which we call batches in the data. Then the gene expression measurements of suitable biologically homogeneous sets of samples are averaged within batches, and the results called pseudo-samples. Since the variation between pseudo-samples in different batches is mainly unwanted variation, by defining them as pseudo-replicates and used them in RUV-III as replicates, we can easily and effectively remove the unwanted variation. we refer to our paper for more technical details R.Molania, bioRxiv, 2021.
Here, we use the TCGA invasive breast cancer (BRCA) RNA-seq data as an example to show how to remove tumour purity, flow cell chemistry, library size and batch effects (plate effects) from the data. We illustrate the value of our approach by comparing it to the standard TCGA normalizations on the TCGA BRCA RNA-seq data. Further, we demonstrate how unwanted variation can compromise several downstream analyses and can lead to wrong biological conclusions. We will also assess the performance of RUV-III with poorly chosen PRPS and in situations where biological labels are only partially known.
Note that RUV-III with PRPS is not limited to TCGA data: it can be used for any large genomics project involving multiple labs, technicians, platforms, …
1.1 Data preparation
The TCGA consortium aligned RNA sequencing reads to the hg38 reference genome using the STAR aligner and quantified the results at gene level using the HTseq and Gencode v22 gene-annotation Ref. The TCGA RNA-seq data are publicly available in three formats: raw counts, FPKM and FPKM with upper-quartile normalization (FPKM.UQ). All these formats for individual cancer types (33 cancer types, ~ 11000 samples) were downloaded using the TCGAbiolinks R/Bioconductor package (version 2.16.1). The TCGA normalized microarray gene expression data were downloaded from the Broad GDAC Firehose repository , data version 2016/01/28. Tissue source sites (TSS), and batches of sequencing-plates were extracted from individual TCGA patient barcodes, and sample processing times were downloaded from the MD Anderson Cancer Centre TCGA Batch Effects website. Pathological features of cancer patients were downloaded from the Broad GDAC Firehose repository (https://gdac.broadinstitute.org). The details of processing the TCGA BRCA RNA-seq samples using two flow cell chemistries were received by personal communication from Dr. K Hoadley. The TCGA survival data reported by Liu et al. were used in this paper. The consensus measurement of purity estimation (CPE) were downloaded from the Aran et al study.
We have generated SummarizedExperiment objects for all the TCGA RNA-seq datasets. These datasets can be found here TCGA_PanCancerRNAseq. Unwanted variation of all the datasets can be explored using an Rshiny application published in (R.Molania, bioRxiv, 2021).
All datasets that are required for this vignette can be found here link
2 TCGA BRCA gene expression data
2.1 RNA-seq data
We load the TCGA_SummarizedExperiment_HTseq_BRCA.rds file. This is a SummarizedExperiment object that contains:
assays:
-Raw counts
-FPKM
-FPKM.UQ
colData:
-Batch information
-Clinical information (collected from different resources)
rowData:
-Genes’ details (GC, chromosome, …)
-Several lists of housekeeping genes
The lists of housekeeping genes might be suitable to use as negative control genes (NCG) for the RUV-III normalization.
source('Libraries_HelperFunctions_ForBRCAVignette.R')
brca.se <- readRDS(
'../TCGA_SummarizedExperiment_HTseq_BRCA.rds'
) # 56493 1222
all.samples <- colnames(brca.se)
n.cores <- 52.1.1 Removing genes and plates or samples
Here, we explain what kind of genes and samples are removed before any down-stream analysis.
2.1.1.1 Lowly expressed genes
We identify lowly expressed genes in cancer and normal samples separately. Genes with at least 15 raw counts in at least 10% of cancer and normal samples are retained for down-stream analyses. These details can be found in the “keep.cancer” and ‘keep.normal’ columns of the gene annotation file in the SummarizedExperiment object.
normal.tissues <- SummarizedExperiment::colData(
brca.se
)$tissue == 'normal'
cancer.tissues <- SummarizedExperiment::colData(
brca.se
)$tissue == 'cancer'
keep.genes.normal <- apply(
SummarizedExperiment::assay(brca.se[ , normal.tissues], 'HTseq_counts'),
1,
function(x) length(x[x > 15]) >= round(.11*sum(normal.tissues), digits = 0)
)
keep.genes.normal <- names(keep.genes.normal[keep.genes.normal == TRUE])
keep.genes.cancer <- apply(
SummarizedExperiment::assay(brca.se[ , cancer.tissues], 'HTseq_counts'),
1,
function(x) length(x[x > 15]) >= round(.1*sum(cancer.tissues), digits = 0)
)
keep.genes.cancer <- names(keep.genes.cancer[keep.genes.cancer == TRUE])2.1.1.2 Keep protein conding genes
We also keep only protein coding genes. This detail can be found in the “gene_type” column in the gene annotation file. This is an arbitrary filtering, any gene_type of interest can be retained in the data.
proteinCoding.index <- as.data.frame(
SummarizedExperiment::rowData(brca.se)
)$gene_type. == 'protein.coding'
keep.proteinCoding <- as.data.frame(
SummarizedExperiment::rowData(brca.se)
)$gene_id.v[proteinCoding.index]
selected.genes <- intersect(
unique(c(keep.genes.normal, keep.genes.cancer)),
keep.proteinCoding
)
SummarizedExperiment::rowData(brca.se)$selected.genes <- 'remove'
SummarizedExperiment::rowData(brca.se)$selected.genes[
SummarizedExperiment::rowData(brca.se)$gene_id.v %in% selected.genes
] <- 'keep'2.1.1.3 Remove genes that have no or duplicated ENTREZ or gene symbol ids
Further, we remove genes that have no or duplicated ENTREZ gene id or gene symbol.
### entrez ids
SummarizedExperiment::rowData(brca.se)$entrezgene.use <- 'keep'
na.duplicated <- is.na(SummarizedExperiment::rowData(brca.se)$entrezgene_id_BioMart) |
duplicated(SummarizedExperiment::rowData(brca.se)$entrezgene_id_BioMart)
SummarizedExperiment::rowData(brca.se)$entrezgene.use[na.duplicated] <- 'remove'
### gene symbol
SummarizedExperiment::rowData(brca.se)$geneName.use <- 'keep'
na.duplicated <- is.na(SummarizedExperiment::rowData(brca.se)$gene_name.) |
duplicated(SummarizedExperiment::rowData(brca.se)$gene_name.)
SummarizedExperiment::rowData(brca.se)$geneName.use[na.duplicated] <- 'remove'
keep.genes <- SummarizedExperiment::rowData(brca.se)$entrezgene.use == 'keep' &
SummarizedExperiment::rowData(brca.se)$geneName.use == 'keep' &
SummarizedExperiment::rowData(brca.se)$selected.genes == 'keep'
brca.se <- brca.se[ keep.genes, ] # 16537 12222.1.1.4 Keep plates with at least 2 samples
The BRCA RNA-seq study involved 1180 assays generated using 38 plates over 5 years (R.Molania, bioRxiv, 2021). We keep plates with at least three samples for down-stream analyses.
keep.plates <- names(which(table(brca.se$plate_RNAseq) > 2))
keep.plates <- brca.se$plate_RNAseq %in% keep.plates
brca.se <- brca.se[ , keep.plates]2.1.2 PAM50 subtypes
Breast cancers intrinsic subtypes including HER2-enriched, Basal-like, Luminal A, Luminal B, and Normal-like, are based on a 50-gene expression signatures.
Here, we add the PAM50 subtypes annotations obtained from the TCGA BRCA research network. Later, we will use different approaches to identify the PAM50 subtypes in the TCGA BRCA RNA-seq data.
brca.pam50Calls <- read.delim('tcga.brca.pam50.calls.ucf.txt')
row.names(brca.pam50Calls) <- brca.pam50Calls$Barcode
### Remove samples
brca.pam50Calls <- droplevels(
brca.pam50Calls[brca.pam50Calls$Use == 'YES' , ]
)
### common samples
common.samples <- intersect(
colnames(brca.se),
brca.pam50Calls$Barcode
)
brca.pam50Calls <- brca.pam50Calls[common.samples , ]
brca.se <- brca.se[ , common.samples]
col.names <- colnames(brca.pam50Calls)
for(i in 1:length(col.names)){
SummarizedExperiment::colData(brca.se)[ , col.names[i]] <- brca.pam50Calls[ , i]
}
remove.sample <- which(
SummarizedExperiment::colData(brca.se)$Tissue.Type == 'adjacent normal' &
SummarizedExperiment::colData(brca.se)$Call == 'LumA'
)
brca.se <- brca.se[ , -remove.sample]
normal.samples <- SummarizedExperiment::colData(
brca.se
)$Tissue.Type == 'adjacent normal'
SummarizedExperiment::colData(
brca.se
)$Call[normal.samples] <- 'Adjacent normal'
SummarizedExperiment::colData(
brca.se
)$Call[SummarizedExperiment::colData(
brca.se
)$Call == 'Normal'] <- 'Normal like'
SummarizedExperiment::colData(brca.se)$Call <- factor(
SummarizedExperiment::colData(brca.se)$Call,
levels = c(
'Adjacent normal' ,
'Basal',
'Her2',
'LumA',
'LumB',
'Normal like'))2.1.3 Library size (sequencing-depth)
After removing genes and samples, we compute library size (total counts) and add this to the SummarizedExperiment object. We will use log2 of library size for all down-stream analyses.
brca.se$libSize <- log2(
Matrix::colSums(SummarizedExperiment::assay(brca.se, 'HTseq_counts')))2.1.4 Flow cell chemistry
The TCGA BRCA RNA-seq samples collected in 2010:2011 were profiled using one flow cell chemistry, and the remaining samples were profiled using a different flow cell chemistry (personal communication from TCGA). We add this detail to the the SummarizedExperiment object.
SummarizedExperiment::colData(brca.se)$FcCh <- '2012:2014'
SummarizedExperiment::colData(brca.se)$FcCh[
SummarizedExperiment::colData(brca.se)$year_mda < 2012 ] <- '2010:2011'
SummarizedExperiment::colData(brca.se)$FcCh <- factor(
SummarizedExperiment::colData(brca.se)$FcCh,
levels = c('2010:2011', '2012:2014'))2.1.5 Major biological groups
We need to identify major gene expression-based biological populations in order to create pseudo-samples (R.Molania, bioRxiv, 2021) and assess the performance of different normalization methods in the separation of these subtypes. As we have mentioned above, the PAM50 subtypes are well-known in the breast cancer gene expression data. Here, we adopt different approaches to identify the PAM50 subtypes in the TCGA RNA-Seq data.
2.1.5.1 The genefu R package
We use the function molecular.subtyping() in the genefu R/Bioconductor package to identify the PAM50 subtypes in the raw counts, FPKM and FPKM.UQ datasets.
tcga.harmonized <- names(
SummarizedExperiment::assays(brca.se)
)
row.names(brca.se) <- SummarizedExperiment::rowData(brca.se)$gene_name.
pam50.geneFu <- lapply(
tcga.harmonized,
function(x){
data <- log2(as.matrix(SummarizedExperiment::assay(
brca.se, x)) + 1)
.pam50.geneFu(
expr.data = data,
gene.annot = as.data.frame(
SummarizedExperiment::rowData(brca.se)))
})
names(pam50.geneFu) <- tcga.harmonized
### Add to sample annotation
col.names <- c(
'pam50.geneFu.raw',
'pam50.geneFu.fpkm',
'pam50.geneFu.fpkmUq')
for(i in 1:3){
pam50.geneFu[[i]]$subtype <-
as.character(pam50.geneFu[[i]]$subtype )
index.a <- pam50.geneFu[[i]]$subtype == 'Normal'
pam50.geneFu[[i]]$subtype[index.a] <- 'Normal like'
index.b <- brca.se$Tissue.Type == 'adjacent normal'
pam50.geneFu[[i]]$subtype[index.b] <- 'Adjacent normal'
SummarizedExperiment::colData(brca.se)[ , col.names[i]] <-
factor(
x = pam50.geneFu[[i]]$subtype,
levels = c(
"Adjacent normal",
"Basal",
"Her2",
"LumA",
"LumB",
"Normal like"))}2.1.5.2 Picornell et.al’s algorithm
Here, we use a PAM50 classifier proposed by A.C.Picornell et.al to identify the PAM50 subtypes in the data.
## [1] "Number of genes used: 50"
## [1] "Number of genes used: 50"
## [1] "Number of genes used: 50"
We also, run this classifier on the ER-balanced dataset. We use the ER estimates = 1.4 to divide samples into ER positive and negative groups and the calculate the calibration (median normalization) factors.
## [1] "Number of genes used: 50"
## [1] "Number of genes used: 50"
## [1] "Number of genes used: 50"
2.2 Microarray data
The TCGA BRCA microarray gene expression data were manually downloaded from the TCGA firehose website. The data version 2016_01_28. This data will be used as orthogonal platform to assess the performance of different RNA-seq normalizations.
brca.microarray <- read.delim(
'../BRCA.transcriptome__agilentg4502a_07_3__unc_edu__Level_3__unc_lowess_normalization_gene_level__data.data.txt',
row.names = 1,
stringsAsFactors = FALSE
)
brca.microarray <- brca.microarray[-1 , ]
row.labels <- row.names(brca.microarray)
brca.microarray <- apply(
brca.microarray,
2,
function(x) round(as.numeric(x), 3))
row.names(brca.microarray) <- row.labels
brca.microarray <- brca.microarray[
complete.cases(brca.microarray) , ]
colnames(brca.microarray) <- gsub(
'\\.',
'-',
colnames(brca.microarray))
### common samples with RNAseq
common.samples <- intersect(
all.samples,
colnames(brca.microarray)
) # 580
brca.microarray <- brca.microarray[ , common.samples]
normal.tissue.micor <- grep(
'11A',
colnames(brca.microarray)
)
brca.microarray <- brca.microarray[ , -normal.tissue.micor] # 17280 5253 The TCGA BRCA study outline
The BRCA RNA-seq study involved 1180 assays (after filtration as above) that were carried out on samples from 40 tissue source sites (TSS), distributed across 38 plates, and profiled over five years from 2010 to 2014 (Figure 3.1). The samples collected in 2010 and 2011 were profiled using one flow cell chemistry, and the remaining samples were profiled using a different flow cell chemistry (personal communication from TCGA). There were 94 adjacent normal breast tissue samples and 7 paired primary-metastatic samples in the study.
# brca.se <- readRDS('../Vig_brca.se.rds')
selected.columns <- c(
'year_mda',
'plate_RNAseq',
'tss_RNAseq',
'libSize',
'purity_HTseq_FPKM',
'Tissue.Type',
'pam50.geneFu.fpkm',
'Call',
'FcCh')
sample.info <- SummarizedExperiment::colData(
brca.se)[ , selected.columns]
### Year
H.year <- ComplexHeatmap::Heatmap(
rev(sample.info$year_mda),
cluster_columns = FALSE,
column_names_gp = grid::gpar(fontsize = 12),
col = years.colors,
name = 'Time (years)',
heatmap_legend_param = list(
color_bar = "discrete" ,
ncol = 2,
title_gp = grid::gpar(fontsize = 12)))
### Plates
n.plate <- length(unique(sample.info$plate_RNAseq)) # 38
colfunc <- grDevices::colorRampPalette(
RColorBrewer::brewer.pal(11, 'PRGn')[-6])
color.plates <- colfunc(n.plate)
H.plate <- ComplexHeatmap::Heatmap(
rev(sample.info$plate_RNAseq),
cluster_rows = FALSE,
cluster_columns = FALSE,
column_names_gp = grid::gpar(fontsize = 12),
col = color.plates,
name = 'Plates',
heatmap_legend_param = list(
color_bar = "discrete" ,
ncol = 4,
title_gp = grid::gpar(fontsize = 12)))
### TSS
n.tss <- length(unique(sample.info$tss_RNAseq)) # 40
colfunc <- grDevices::colorRampPalette(
RColorBrewer::brewer.pal(11, 'BrBG')[-6]
)
color.tss <- colfunc(n.tss)
H.tss <- ComplexHeatmap::Heatmap(
rev(sample.info$tss_RNAseq),
cluster_rows = FALSE,
cluster_columns = FALSE,
column_names_gp = grid::gpar(fontsize = 12),
col = color.tss,
name = 'Tissue source sites',
heatmap_legend_param = list(
color_bar = "discrete" ,
ncol = 4,
title_gp = grid::gpar(fontsize = 12)))
### Tissue
H.tissue <- ComplexHeatmap::Heatmap(
rev(sample.info$Tissue.Type),
cluster_rows = FALSE,
column_names_gp = grid::gpar(fontsize = 12),
col = c("#252525", 'blue', "#D9D9D9"),
name = 'Tissues',
heatmap_legend_param = list(
color_bar = "discrete" ,
direction = "vertical",
ncol = 1,
title_gp = grid::gpar(fontsize = 12),
labels = c(
'Primary tumor',
'Metastatic tumor',
'Adjacent normal')))
### Purity
H.purity <- ComplexHeatmap::Heatmap(
rev(sample.info$purity_HTseq_FPKM),
column_names_gp = grid::gpar(fontsize = 12),
cluster_rows = FALSE,
name = 'Tumor purity score',
col = viridis::plasma(n = 10),
heatmap_legend_param = list(
title_gp = grid::gpar(fontsize = 12)))
### library size
H.ls <- ComplexHeatmap::Heatmap(
rev(sample.info$libSize),
cluster_rows = FALSE,
name = 'Library size',
column_names_gp = grid::gpar(fontsize = 12),
col = viridis::viridis(n = 10),
heatmap_legend_param = list(
title_gp = grid::gpar(fontsize = 12)))
### PAM50
H.pam50.tcga <- ComplexHeatmap::Heatmap(
rev(sample.info$Call),
cluster_rows = FALSE,
name = 'PAM50 (TCGA calls)',
column_names_gp = grid::gpar(fontsize = 12),
col = pam50.colors,
heatmap_legend_param = list(
title_gp = grid::gpar(fontsize = 12)))
### PAM50 genefu
H.pam50.genefu <- ComplexHeatmap::Heatmap(
rev(sample.info$pam50.geneFu.fpkm),
cluster_rows = FALSE,
name = 'PAM50 (Genefu calls)',
column_names_gp = grid::gpar(fontsize = 12),
col = pam50.colors,
heatmap_legend_param = list(
title_gp = grid::gpar(fontsize = 12)))
### Flow cell chemistry
H.fcch <- ComplexHeatmap::Heatmap(
rev(sample.info$FcCh),
cluster_rows = FALSE,
name = 'Flow cell chemistry',
column_names_gp = grid::gpar(fontsize = 12),
col = FcCh.colors,
heatmap_legend_param = list(
title_gp = grid::gpar(fontsize = 12),
direction = "horizontal"))
ComplexHeatmap::draw(
H.year +
H.fcch +
H.plate +
H.tss +
H.tissue +
H.pam50.tcga +
H.pam50.genefu +
H.ls +
H.purity,
merge_legends = FALSE,
heatmap_legend_side = 'right')Figure 3.1: Study outline of the TCGA BRCA RNA-seq data.
4 Laser capture microdissection gene expression data
In this vignette, we will show how to use RUV-III with PRPS to remove or reduce tumor purity variation from the TCGA RNA-seq data. We use breast cancer laser capture microdissection (LCM) gene expression data to be able to assess the performance of RUV-III-PRPS in terms of removing tumour purity. We are not able to use the TCGA breast cancer microarray gene expression data as this data is being affected by tumor purity as well.
Here, we download the breast cancer laser capture microdissection gene expression data from Toro AL et.al study.
gse78958 <- GEOquery::getGEO('GSE78958')
expr.78958 <- as.data.frame(Biobase::exprs(gse78958[[1]])) # 22277 424
sampleAnnot.78958 <- Biobase::pData(gse78958[[1]]) # 424 40
gene.annot.78958 <- as.data.frame(
gse78958[[1]]@featureData@data
)
### unique ids
enterz <- sapply(
strsplit(gene.annot.78958$ENTREZ_GENE_ID, '///'),
length
)
gene.symbol <- sapply(
strsplit(gene.annot.78958$`Gene Symbol`, '///'),
length
)
unique.ids <- intersect(
which(enterz < 2),
which(gene.symbol < 2)
)
gene.annot.78958 <- gene.annot.78958[unique.ids, ]
gene.annot.78958$unique.ids <- paste(
gene.annot.78958$ENTREZ_GENE_ID,
gene.annot.78958$`Gene Symbol`,
sep = '_')
gene.annot.78958 <- gene.annot.78958[gene.annot.78958$unique.ids != "_", ] # 19820 17
expr.78958 <- expr.78958[gene.annot.78958$ID , ] # 19820 424
expr.78958$gene <- as.factor(
gene.annot.78958$unique.ids)
### Averaging probes for individual genes
expr.78958 <- data.table::as.data.table(expr.78958)
expr.78958 <- as.data.frame(
expr.78958[ , lapply(.SD, mean),
'gene']
) # 12549 425
row.names(expr.78958) <- expr.78958$gene
expr.78958 <- expr.78958[, -1]
expr.78958 <- as.matrix(expr.78958)
gene.annot.78958 <- gene.annot.78958[
!duplicated(gene.annot.78958$unique.ids) , ] # 21666 17
expr.78958 <- expr.78958[
match(
gene.annot.78958$unique.ids,
row.names(expr.78958)
) , ] # 12549 4245 RUV-III normalization
Here, we apply the RUV-III with PRPS normalization on the TCGA BRCA RNA-seq data. The RUV-III method makes essential use of negative control genes and technical replicates to estimate unwanted variation and remove it from the data (R.Molania, NAR, 2019).
Here, we will explain how to select a suitable set of negative control genes and pseud-replicates of pseudo-samples to remove tumour purity variation, library size variation, flow cell chemistry and an unknown source of unwanted variation from the TCGA BRCA RNA-seq data.
5.1 Selection of negative control genes (NCG)
First, we need to highlight that in our usage, a negative control gene is one that is not expected to change much across biological factors of interests(R.Molania, NAR, 2019). Second, in the presence of unwanted variation usually a subset of genes are affected in different ways, so we need to emphasize that our approach to negative controls is pragmatic: if using a given gene in RUV-III as one of the set of negative control genes helps, as indicated by various measures, then whether or not it is an ideal negative control gene is moot: it helped. This does not rule out the fact that we may be able to do better by replacing that gene with a different gene designated a negative control. Third, we point out that theoretical analyses not presented here show that it is not the extent to which individual genes designated as negative controls are ideal or less than ideal negative controls that drives the success or otherwise of RUV-III; that is a property of the full set of negative controls. We will frequently get very good results using the entire set of genes being studied as negative controls, even when many genes are changing. (That this is not unreasonable follow from the theory just mentioned.) Fourth, it is usually the case that using more genes as negative control genes is better than using fewer. That is a matter of stability, but as stated in the introduction, there is a bias-variance trade-off here: using too many genes as negative controls may be counter-productive. You must look and see. Fifth, endogenous genes generally make more suitable negative controls than spike-ins. The reason here is the obvious one, namely, that endogenous genes have shared the complete sample experience of the other genes, whereas spike-ins can only reflect unwanted variation in the process from the point at which they were added onward. The best source of endogenous negative control genes are ones that were found to be stable in previous studies similar to the one being analyzed.
Here, we explain an approach the we used in our paper to select a suitable set of negative control genes for RUV-III normalization of the TCGA BRCA RNA-seq data.
First, we select samples that they have the same PAM50 subtypes obtained by applying the geneFu classifier using the TCGA FPKM and FPKM.UQ data. We found 981 samples that meet this criteria.
index.cancer <- brca.se$tissue == 'cancer'
brca.cancer.se <- brca.se[ , index.cancer]
brca.geneAnnot.ncg <- SummarizedExperiment::rowData(brca.cancer.se)
brca.sampleAnnot.cancer <- as.data.frame(
SummarizedExperiment::colData(brca.cancer.se)
)
brca.fpkmUq.cancer <- SummarizedExperiment::assay(
brca.cancer.se,
'HTseq_FPKM.UQ'
)
brca.rawCounts.cancer <- SummarizedExperiment::assay(
brca.cancer.se,
'HTseq_counts'
)
### consensus PAM50 calls
cols <- c(
'pam50.geneFu.fpkm',
'pam50.geneFu.fpkmUq'
)
consensus.pam50.calls <- apply(
brca.sampleAnnot.cancer[ , cols, drop = FALSE],
1,
function(x) length(unique(x))
)
brca.sampleAnnot.cancer$pam50.consensus <- consensus.pam50.calls
samples.to.use <- brca.sampleAnnot.cancer$pam50.consensus == 1
ncg.sample.info <- droplevels(
brca.sampleAnnot.cancer[samples.to.use, ]
)
ncg.sample.info$pam50 <- ncg.sample.info$pam50.geneFu.fpkm
ncg.data <- brca.fpkmUq.cancer[ , samples.to.use]5.1.1 ANOVA between gene expression and the PAM50 subtypes
Then, we apply ANOVA on individual genes expression with the PAM50 being a factor within the first batch of flow cell chemistry. The F-statistics are ranked and saved in the gene annotation file. This analysis helps to identify genes that are not highly affected by the PAM50 subtypes.
index.fcch <- ncg.sample.info$FcCh == levels(ncg.sample.info$FcCh)[1]
ftest.pam50.FcCh <- ftest.pam50 <- .Ftest(
data = ncg.data[ , index.fcch],
variable = ncg.sample.info$pam50[index.fcch],
is.log = FALSE,
n.cores = n.cores
)
brca.geneAnnot.ncg$pam50.de.2010.11 <- rank(
ftest.pam50.FcCh$FValue
)5.1.2 ANOVA between gene expression and flow cell chemistry
To identify genes that are highly affected by the flow cell chemistry effect, we perform ANOVA on individual genes expression with the flow cell chemistry being a factor within individual PAM50 subtypes.
ftest.FcCh.PerPam50 <- lapply(
levels(ncg.sample.info$pam50)[1:4],
function(x){
index <- ncg.sample.info$pam50.erBalanced.FPKM.UQ == x
.Ftest(
data = ncg.data[, index],
variable = ncg.sample.info$FcCh[index],
is.log = FALSE,
n.cores = n.cores
)
})
names(ftest.FcCh.PerPam50) <- paste0(
'FcCh_',
levels(ncg.sample.info$pam50)[1:4]
)
for(i in names(ftest.FcCh.PerPam50) ){
index.na <- is.na(ftest.FcCh.PerPam50[[i]]$FValue)
ftest.FcCh.PerPam50[[i]]$FValue[index.na] <-
mean(ftest.FcCh.PerPam50[[i]]$FValue, na.rm = TRUE)
brca.geneAnnot.ncg[, i] <- rank(-ftest.FcCh.PerPam50[[i]]$FValue)
}5.1.3 Correlation between gene expression and tumour purity
Here, we perform Spearman correlation analysis between individual gene expression and tumor purity variation within each PAM50 sutypes. This analysis helps us to identify genes that are affected by tumor purity.
ncg.sample.info$fch.pam50 <- paste(
ncg.sample.info$pam50,
ncg.sample.info$FcCh,
sep = '_'
)
corr.gene.purity.per.pam50.fcch <- lapply(
unique(ncg.sample.info$fch.pam50)[1:4],
function(x){
index <- ncg.sample.info$fch.pam50 == x
.corr.gene.variable(
expr.data = ncg.data[, index],
variable = ncg.sample.info$purity_HTseq_FPKM[index],
is.log = FALSE,
method = 'spearman',
n.cores = n.cores,
group = 'purity'
)
})
names(corr.gene.purity.per.pam50.fcch) <- paste0(
'PurityCorr_',
unique(ncg.sample.info$fch.pam50)[1:4]
)
for(i in names(corr.gene.purity.per.pam50.fcch) ){
index.na <- is.na(corr.gene.purity.per.pam50.fcch[[i]]$purity_rho)
corr.gene.purity.per.pam50.fcch[[i]]$purity_rho[index.na] <-
mean(corr.gene.purity.per.pam50.fcch[[i]]$purity_rho, na.rm = TRUE)
brca.geneAnnot.ncg[, i] <- abs(corr.gene.purity.per.pam50.fcch[[i]]$purity_rho)
}5.1.4 Correlation between gene expression and library size
Finally, we perform Spearman correlation analysis between individual gene expression and library size to identify genes that are affected by library size. This analysis helps us to identify genes that are affected by library size.
corr.gene.ls.ncg <- .corr.gene.variable(
expr.data = brca.rawCounts.cancer,
variable = brca.sampleAnnot.cancer$libSize,
is.log = FALSE,
method = 'spearman',
n.cores = n.cores,
group = 'purity'
)
brca.geneAnnot.ncg$corr.ls <- corr.gene.ls.ncg$purity_rho5.1.5 Putting it all together
So, we select genes that: 1) are not highly affected by the PAM50 variation (major biological variation), 2) are highly affected by flow cell chemistries, 3) are highly correlated with tumour purity and 4) are highly correlated with library size. We use different cut-offs to selected different sets of NCGs and assessed the performance. We found that the NCGs set 6 (see the R code below) is better than the other sets. Then, we will use that NCGs set for RUV-III normalizations.
### FsCh
nGenes <- c(
1000,
2000,
4000,
6000,
8000,
10000,
12000
)
purity.corr <- .7
ls.corr <- .7
ncg.set.FsCh <- lapply(
nGenes,
function(x){
ncg.set <- brca.geneAnnot.ncg$pam50.de.2010.11 < x &
brca.geneAnnot.ncg$FcCh_Her2 < x &
brca.geneAnnot.ncg$FcCh_Basal < x &
brca.geneAnnot.ncg$FcCh_LumA < x &
brca.geneAnnot.ncg$FcCh_LumB < x
return(ncg.set)
})
ncg.set.purity <- lapply(
nGenes,
function(x){
ncg.set <- brca.geneAnnot.ncg$pam50.de.2010.11 < x &
brca.geneAnnot.ncg$`PurityCorr_LumB_2010:2011` > purity.corr &
brca.geneAnnot.ncg$`PurityCorr_LumA_2010:2011` > purity.corr &
brca.geneAnnot.ncg$`PurityCorr_Her2_2010:2011` > purity.corr &
brca.geneAnnot.ncg$`PurityCorr_Basal_2010:2011` > purity.corr
return(ncg.set)
})
ncg.set.ls <- lapply(
nGenes,
function(x){
ncg.set <- brca.geneAnnot.ncg$pam50.de.2010.11 < x &
brca.geneAnnot.ncg$corr.ls > ls.corr
return(ncg.set)
})
all.ncg.sets <- lapply(
c(1:length(nGenes)),
function(x){
ncg.set <- c(
brca.geneAnnot.ncg$gene_name.[ncg.set.ls[[x]]],
brca.geneAnnot.ncg$gene_name.[ncg.set.purity[[x]]],
brca.geneAnnot.ncg$gene_name.[ncg.set.FsCh[[x]]]
)
})5.1.6 Assessments of negative control genes
We perform PCA on the raw counts of the TCGA BRCA RNA-seq data using only the selected NCG to assess their performance. Ideally, they should capture tumor purity variation, flow cell chemistry and library size effects. Note that, the RUV-III method is robust to negative control genes, but not always.
Figure 5.1 shows the first three PCs on the selected NCG in the raw counts of the TCGA BRCA RNA-seq data. We do not see the PAM50 clusters, but we do see the flow cell chemistry effects.
# all.ncg.sets <- readRDS('../Vig_all.ncg.sets.rds')
pca.ncg <- .pca(
data = brca.rawCounts.cancer[
all.ncg.sets[[6]] , ],
is.log = FALSE
)
cols <- c(
'pam50.geneFu.fpkm',
'FcCh'
)
pam50.colors.b <- pam50.colors[2:6]
name.cols <- c(
'PAM50',
'Flow cell chemistry'
)
colors <- c(
'pam50.colors.b',
'FcCh.colors'
)
pp <- lapply(
c(1:2),
function(x){
p <- .scatter.density.pc(
pcs = pca.ncg$sing.val$u[, 1:3],
pc.var = pca.ncg$variation,
group.name = name.cols[x],
group = brca.sampleAnnot.cancer[ , cols[x]],
color = base::get(colors[x]),
strokeSize = .2,
pointSize = 2,
strokeColor = 'gray30',
alpha = .6,
title = 'Negative control genes'
)
p
})
do.call(
gridExtra::grid.arrange,
c(pp[[1]],
pp[[2]],
ncol = 4))Figure 5.1: PCA plots of the raw counts of the TCGA BRCA RNA-seq data using only the negative control genes (4514 genes).
Further, Figure 5.2 shows the relationship between the first and third PCs of the negative control genes with library size and tumour purity scores. These results show that the NCG are suitable for RUV-III normalization.
df <- data.frame(
pc1 = pca.ncg$sing.val$u[,1],
pc3 = pca.ncg$sing.val$u[,3],
LibSize = brca.cancer.se$libSize,
purity = brca.cancer.se$purity_HTseq_FPKM.UQ)
p1 <- ggplot(df, aes(x = LibSize, y = pc1)) +
geom_point(color = 'grey40', alpha = .5) +
xlab(expression(Log[2]~'library size')) +
ylab('PC1') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10))
p2 <- ggplot(df, aes(x = purity, y = pc3)) +
geom_point(color = 'grey40', alpha = .5) +
xlab('Tumour purity scores') +
ylab('PC3') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10)
)
gridExtra::grid.arrange(
p1,
p2,
ncol = 2)Figure 5.2: Scatter plots show relationship between the first and third PCs of the negative control genes with library size and tumour purity scores
5.2 Pseudo-replicates of pseudo-samples (PRPS)
As we have mentioned above, the RUV-III method also requires technical replicates to estimate unwanted variation (R.Molania, NAR, 2019).
Here, we propose a new approach, pseudo-replicates of pseudo-samples, for deploying RUV-III method to remove unwanted variation from the TCGA BRCA RNA-seq data. As we have mentioned above, this approach requires at least one roughly known biologically homogeneous subclass of samples shared across the sources of unwanted variation. Here, we use the samples with the consensus PAM50 subtypes (981 samples) to create different sets of PRPS.
Here, we create different sets of PRPS for library size, flow cell chemistries and tumor purity. Note that, we aim to remove tumour purity, flow cell chemistry and library size effects, then we need to create three sets of PRPS.
samples.to.use <- brca.sampleAnnot.cancer$pam50.consensus == 1
prps.brca <- .CreatePseudoSamplesForLsPurityBatch(
expr.data = brca.rawCounts.cancer[ , samples.to.use],
sample.info = droplevels(brca.sampleAnnot.cancer[samples.to.use , ]),
librarySize = 'libSize',
batch = 'PlateId_mda',
biology ='pam50.geneFu.fpkmUq',
purity = 'purity_HTseq_FPKM',
include.ls = TRUE,
include.purity = TRUE,
minSamplesPerBatchPS = 3,
minSamplesForPurityPerBiology = 12,
minSamplesForPurityPS = 3,
minSamplesForLibrarySizePerBatch = 12,
minSamplesForLibrarySizePS = 3)5.2.1 PRPS for tumour purity
For removing the effect of tumor purity in the TCGA BRCA RNA-seq data, we define sets of PRPS for each PAM50 subtype in addition to those that we will create for removing library size, flow cell chemistries and plate to plate variation. We perform this by selecting the samples with the 3 highest and the samples with the 3 lowest values of tumor purity within each PAM50 subtype. Then we create two pseudo-samples within each PAM50 subtype by averaging the gene expression values across each set of 3 high purity samples and each set of 3 low purity samples. Figure 5.3 shows tumour purity score of pseudo-samples created for removing tumour purity variation. These are 5 sets of PRPS (duplicate pseudo-sample pairs).
ps.purity.brca <- prps.brca$ps.purity
ps.purity <- lapply(
levels(brca.sampleAnnot.cancer$pam50.geneFu.fpkm)[2:6],
function(x){
info <- brca.sampleAnnot.cancer[brca.sampleAnnot.cancer$pam50.consensus == 1 , ]
info <- info[info$pam50.geneFu.fpkm == x , ]
purity <- sort(info$purity_HTseq_FPKM)
high <- mean(purity[1:3])
low <- mean(purity[c(c(length(purity)- 2):length(purity)) ])
all <- c(low, high)
all
})
names(ps.purity) <- levels(
brca.sampleAnnot.cancer$pam50.geneFu.fpkm
)[2:6]
ps.purity <- as.data.frame(ps.purity)
ps.purity$purity <- c('High', 'Low')
ps.purity <- ps.purity %>%
tidyr::pivot_longer(
-purity,
names_to = 'Pam50',
values_to = 'Pur') %>%
data.frame(.)
### Plot
ggplot(ps.purity, aes(x = purity, y = Pur)) +
geom_point(size = 2) +
geom_line(aes(x = as.numeric(as.factor(purity)), y = Pur)) +
xlab('Groups') +
ylim(c(.22, .7)) +
ylab('Tumour purity socre') +
facet_wrap(~Pam50, ncol = 5) +
theme(
axis.text.x = element_text(size = 8, hjust = 1),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
strip.text.y = element_text(size = 12))Figure 5.3: Tumour purity scores of pseudo-samples created for removing tumour purity variation.
5.2.2 PRPS for flow cell chemistry/plates
5.2.2.1 PRPS map
Figure 5.4 shows the distribution of the PAM50 subtypes across plates in the data. To create PS, we average gene expression of at least 3 samples with respect to the biological populations and plates. These sets of PRPS help to remove flow cell chemistry and plate effects.
ps.sample.info <- droplevels(
brca.sampleAnnot.cancer[brca.sampleAnnot.cancer$pam50.consensus == 1, ]
)
new.info <- ps.sample.info
new.info$new.batch <- paste0(
new.info$year_mda,
'_',
new.info$PlateId_mda
)
new.info$biololy <- paste0(
new.info$pam50.geneFu.fpkm,
'_',
new.info$msi.status)
df_count <- new.info %>%
dplyr::count(new.batch, biololy)
df_count$use <- 'unselected'
df_count$use[df_count$n > 2] <- 'Selected'
ggplot(df_count, aes(x = new.batch, y = biololy)) +
geom_count(aes(color = use)) +
geom_text(aes(
label = n,
hjust = 0.5,
vjust = 0.5
)) +
xlab('Years-plates') +
ylab('Biological groups') +
theme_bw()+
theme(
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 0),
axis.text.x = element_text(size = 10,angle = 45,hjust = 1),
axis.text.y = element_text(size = 12, angle = 45, hjust = 1),
legend.position = 'none')Figure 5.4: Plot showing the sample sizes of the major biological groups across plates in the TCGA BRCA RNA-seq data.
5.2.2.2 Library size of PRPS
Figure 5.5 shows the library size of the pseudo-samples of each pseudo-replicate sets across plates. as expected, they capture the flow cell chemistry that we aim to remove.
ps.samples <- base::strsplit(
x = colnames(prps.brca$ps.batch),
split = '_'
)
year <- lapply(
1:length(ps.samples),
function(x) {
index <- which(new.info$plate_RNAseq == ps.samples[[x]][2])
unique(new.info$year_mda[index])
})
year.plate <- sapply(
1:length(ps.samples),
function(x) paste(
year[x],
ps.samples[[x]][2],
sep = '_'
)
)
pam50 <- sapply(
ps.samples,
function(x) x[1]
)
ps <- data.frame(
bio = pam50,
year.plate = year.plate,
ls = log2(colSums(prps.brca$ps.batch))
)
ggplot(ps, aes(x = year.plate, y = ls)) +
geom_point(size = 2) +
geom_line(aes(x = as.numeric(as.factor(year.plate)), y = ls)) +
xlab('Years_plates') +
ylab(expression(Log[2]~'library size')) +
facet_grid(bio ~.) +
theme(
axis.text.x = element_text(size = 8, angle = 30, hjust = 1),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
strip.text.y = element_text(size = 12))Figure 5.5: Library size of the PRPS sets for removing flow cell chemistry and plate effects.
5.2.3 PRPS for library size per plate
To create PRPS for library size, we select plates that have at least 12 samples of a particular PAM50 subtypes, and then selected the samples with the 3 highest and the samples with the 3 lowest values of library size. Then we created two pseudo-samples within each PAM50 subtype per plate by averaging the gene expression values across each set of 3 high library size samples and each set of 3 low library size samples.
5.2.3.1 PRPS map
Figure 5.6 shows the distribution of the PAM50 subtypes across plates in the data. To create PS, we average gene expression of at least three samples with respect to the biological populations and plates. These sets of PRPS help to remove flow cell chemistry and plates effects.
ps.sample.info <- droplevels(
brca.sampleAnnot.cancer[brca.sampleAnnot.cancer$pam50.consensus == 1, ]
)
new.info <- ps.sample.info
new.info$new.batch <- paste0(
new.info$year_mda,
'_',
new.info$PlateId_mda
)
new.info$biololy <- paste0(
new.info$pam50.geneFu.fpkm,
'_',
new.info$msi.status)
df_count <- new.info %>%
dplyr::count(new.batch, biololy)
df_count$use <- 'unselected'
df_count$use[df_count$n > 11] <- 'Selected'
ggplot(df_count, aes(x = new.batch, y = biololy)) +
geom_count(aes(color = use)) +
geom_text(aes(
label = n,
hjust = 0.5,
vjust = 0.5
)) +
xlab('Years-plates') +
ylab('Biological groups') +
theme_bw()+
theme(
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 0),
axis.text.x = element_text(size = 10,angle = 45,hjust = 1),
axis.text.y = element_text(size = 12, angle = 45, hjust = 1),
legend.position = 'none')Figure 5.6: A plot showing plates that have at least 12 samples.
5.2.3.2 Library size of PRPS
Figure 5.7 shows the library size of the pseudo-samples of each pseudo-replicate sets within plates.
### ps for library size
ps.ls.brca <- prps.brca$ps.ls
ps.samples <- base::strsplit(
x = colnames(ps.ls.brca),
split = '_'
)
plates <- sapply(
ps.samples,
function(x) x[2]
)
time.years <- vector()
for(i in 1:length(plates)){
index <- grep(plates[i], brca.sampleAnnot.cancer$plate_RNAseq)
year <- unique(brca.sampleAnnot.cancer$year_mda[index])
time.years <- c(time.years, year)
}
colnames(ps.ls.brca) <- paste(
colnames(ps.ls.brca),
time.years,
sep = '_'
)
samples <- colnames(ps.ls.brca)
samples <- strsplit(
x = colnames(ps.ls.brca),
split = '_'
)
pam50 <- sapply(
samples,
function(x) x[1]
)
year.plate <- sapply(
samples,
function(x) paste(x[4], x[2], sep = '_')
)
ps <- data.frame(
bio = pam50,
year.plate = year.plate,
ls = log2(colSums(ps.ls.brca)))
### plot
ggplot(ps, aes(x = year.plate, y = ls)) +
geom_point(size = 3) +
geom_line() +
xlab('Years_plates') +
ylim(c(24, 27.5)) +
ylab(expression(Log[2]~'library size')) +
facet_grid(bio ~.) +
theme(
axis.text.x = element_text(size = 8, angle = 30, hjust = 1),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
strip.text.y = element_text(size = 12))Figure 5.7: Library sizes of pseudo-samples created for removing plate library sizes.
5.3 RUV-III-PRPS normalization
Here, we apply the RUV-III normalization with the different sets of PRPS and selected negative control genes. We refer to (R.Molania, bioRxiv, 2021), and (R.Molania, NAR, 2019) for more details about the RUV-III method.
colnames(prps.brca$ps.batch) <- unlist(lapply(
colnames(prps.brca$ps.batch),
function(x){
unlist(strsplit(x, '[_]'))[1]
}))
brca.ruv.input <- t(log2(cbind(
brca.rawCounts.cancer,
prps.brca$ps.ls,
prps.brca$ps.batch,
prps.brca$ps.purity
) + 1)) # 1252 16537
## replicate matrix
rep.matrix.ruv <- ruv::replicate.matrix(
row.names(brca.ruv.input)
) # 1252 1119
## ruv-iii normalization
negative.control.genes <- colnames(brca.ruv.input) %in% all.ncg.sets[[6]]
ruviii.norm <- RUV_III_PRPS(
Y = brca.ruv.input,
M = rep.matrix.ruv,
ctl = negative.control.genes,
k = 12,
eta = NULL,
return.info = TRUE)
ruviii.prps.norm <- t(ruviii.norm$newY[1:1086, ])5.3.1 PAM50 subtypes
Here, we use the genefu R package to identify the PAM50 subtypes in the RUV-III normalized data.
pam50.ruv <- .pam50.geneFu(
expr.data = ruviii.prps.norm,
gene.annot = as.data.frame(SummarizedExperiment::rowData(brca.se))
)
index.cancer <- which(brca.se$tissue == 'cancer')
brca.se$pam50Genefu.ruv <- 'Adjacent normal'
brca.se$pam50Genefu.ruv[index.cancer] <- as.character(pam50.ruv$subtype)
brca.se$pam50Genefu.ruv[brca.se$pam50Genefu.ruv == 'Normal'] <- 'Normal like'
brca.se$pam50Genefu.ruv <- factor(
brca.se$pam50Genefu.ruv,
levels = c(
'Basal',
'Her2',
'LumA',
'LumB',
'Normal like'))6 Statistical summaries
We perform a range of statistical tests on the TCGA raw counts, FPKM , FPKM.UQ and RUV-III normalized datasets. These statistical tests are divided into gene and global-level analyses. We create a new SummarizedExperiment object with all the datasets.
brca.cancer.se <- SummarizedExperiment::SummarizedExperiment(
assays = list(
HTseq_counts = log2(SummarizedExperiment::assay(
brca.se[ , index.cancer],
'HTseq_counts') + 1),
HTseq_FPKM = log2(SummarizedExperiment::assay(
brca.se[ , index.cancer],
'HTseq_FPKM') + 1),
HTseq_FPKM.UQ = log2(SummarizedExperiment::assay(
brca.se[ , index.cancer],
'HTseq_FPKM.UQ') + 1),
RUV_III = ruviii.prps.norm
),
colData = droplevels(S4Vectors::DataFrame(
SummarizedExperiment::colData(brca.se[ , index.cancer]))),
rowData = as.data.frame(
SummarizedExperiment::rowData(brca.se))
)
normalizations <- names(
SummarizedExperiment::assays(brca.cancer.se)
)
normalizations.names <- c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III')
pam50.levels <- levels(
brca.cancer.se$pam50.geneFu.fpkmUq)
genefu.calls <- c(
'pam50.geneFu.raw',
'pam50.geneFu.fpkm',
'pam50.geneFu.fpkmUq',
'pam50Genefu.ruv')6.1 Principal component analysis
Here, we perform principal component analysis (PCA) across all samples for the individual datasets. We also perform PCA within each PAM50 subtype obtained by the genefu R package for the individual datasets.
### across all samples
pca.all <- lapply(
normalizations,
function(x){
.pca(
data = as.matrix(
SummarizedExperiment::assay(brca.cancer.se, x)
),
is.log = TRUE)
})
names(pca.all) <- normalizations
### within PAM50 subtypes
pca.pam50Genefu <- lapply(
pam50.levels,
function(x){
pca.within <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se
)[ , genefu.calls[y]] == x
.pca(
data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se[ , index],
normalizations[y])
),
is.log = TRUE)
})
names(pca.within) <- normalizations
pca.within
})
names(pca.pam50Genefu) <- pam50.levels6.2 Relative log expression (RLE)
RLE plots are used to reveal trends, temporal clustering and other non-random patterns resulting from unwanted variation in gene expression data Ref. To generate RLE plots, we first formed the log-ratio log[yig/yg] of the raw count yig for gene g in the sample labelled i relative to the median value yg of the counts for gene g taken across all samples. We then generated a boxplot from all the log ratios for sample i, and plotted all such boxplots along a line, where i varies in a meaningful order, usually sample processing date. An ideal RLE plot should have its medians centered around zero while its box widths, their interquartile ranges (IQR) should be similar in magnitude.
We compute RLE across all samples and within each PAM50 subtypes for individual datasets.
### across all samples
rle.all <- lapply(
normalizations,
function(x){
.rle.comp(
expr.data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se, x)
),
is.log = TRUE
)
})
names(rle.all) <- normalizations
### within PAM50 subtypes
rle.pam50Genefu <- lapply(
pam50.levels,
function(x){
rle.all <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se
)[ , genefu.calls[y]] == x
.rle.comp(
expr.data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se[, index],
normalizations[y])
),
is.log = TRUE)
})
names(rle.all) <- tcga.harmonized
rle.all
})
names(rle.pam50Genefu) <- pam50.levels
# pca.rle <- list(
# pca.all = pca.all,
# pca.pam50Genefu = pca.pam50Genefu,
# rle.all = rle.all,
# rle.pam50Genefu = rle.pam50Genefu
# )
# saveRDS(pca.rle, 'Vig_pca.rle.rds')
# pca.rle <- readRDS('../Vig_pca.rle.rds')
# pca.all <- pca.rle$pca.all
# pca.pam50Genefu <- pca.rle$pca.pam50Genefu
# rle.all <- pca.rle$rle.all
# rle.pam50Genefu <- pca.rle$rle.pam50Genefu6.3 ANOVA
Analysis of variance (ANOVA) enables us to assess the effects of a given qualitative variable (which we call a factor) on gene expression measurements across any set of groups (labelled by the levels of the factor) under study. We use ANOVA F statistics to summarize the effects of a qualitative source of unwanted variation (e.g. batches) on the expression levels of individual genes, where genes having large F statistics are deemed to be affected by the unwanted variation.
We perform ANOVA between individual gene expression and the flow cell chemistry batches across all samples and within the PAM50 subtypes for the individual datasets.
6.3.1 Gene expression and flow cell chemistry
### across all samples
ftest.fcch.all <- lapply(
normalizations,
function(x){
.Ftest(
data = as.matrix(SummarizedExperiment::assay(
brca.cancer.se,
x)
),
variable = brca.cancer.se$FcCh,
is.log = TRUE,
n.cores = n.cores
)
})
names(ftest.fcch.all) <- normalizations
### within each PAM50
ftest.fcch.pam50Genefu <- lapply(
pam50.levels,
function(x){
ftest.fcch.pam50 <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se
)[ , genefu.calls[y]] == x
.Ftest(
data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se[ , index],
y)),
variable = brca.cancer.se$FcCh[index],
is.log = TRUE,
n.cores = n.cores
)
})
names(ftest.fcch.pam50) <- normalizations
ftest.fcch.pam50
})
names(ftest.fcch.pam50Genefu) <- pam50.levels6.3.2 Gene expression and plates
We perform ANOVA between individual gene expression and plates across all samples and within the PAM50 subtypes for individual datasets.
### across all samples
ftest.plate.all <- lapply(
normalizations,
function(x){
.Ftest(
data = as.matrix(SummarizedExperiment::assay(brca.cancer.se, x)),
variable = brca.cancer.se$plate_RNAseq ,
is.log = TRUE,
n.cores = n.cores
)
})
names(ftest.plate.all) <- normalizations
### within PAM50
ftest.plate.pam50Genefu <- lapply(
pam50.levels,
function(x){
ftest.plate.pam50 <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
.Ftest(
data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se[ , index],
normalizations[y])),
var = brca.cancer.se$plate_RNAseq[index],
is.log = TRUE,
n.cores = n.cores
)
})
names(ftest.plate.pam50) <- normalizations
ftest.plate.pam50
})
names(ftest.plate.pam50Genefu) <- pam50.levels6.4 Correlations
Here, we perform Spearman correlation between individual gene expression and different continuous variables including tumour purity and library size.
6.4.1 Gene expression and library size
We perform Spearman correlation between individual gene expression and library size across all samples and within the PAM50 subtypes.
### across all samples
corr.geneLs.all <- lapply(
normalizations,
function(x){
.corr.gene.variable(
expr.data = as.matrix(
SummarizedExperiment::assay(brca.cancer.se, x)
),
is.log = TRUE,
variable = brca.cancer.se$libSize,
method = 'spearman',
n.cores = n.cores,
group = 'ls'
)
})
names(corr.geneLs.all) <- normalizations
### within the PAM50
corr.geneLs.pam50Genefu <- lapply(
pam50.levels,
function(x){
corr.all <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
.corr.gene.variable(
expr.data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se[, index],
normalizations[y])
),
is.log = TRUE,
variable = brca.cancer.se$libSize[index],
n.cores = n.cores,
method = 'spearman',
group = 'ls')
})
names(corr.all) <- normalizations
corr.all
})
names(corr.geneLs.pam50Genefu) <- pam50.levels6.4.2 Gene expression and purity
We perform Spearman correlation between individual gene expression and tumor purity across all samples and within the PAM50 subtypes.
## across all samples
corr.genePurity.all <- lapply(
normalizations,
function(x){
.corr.gene.variable(
expr.data = as.matrix(SummarizedExperiment::assay(brca.cancer.se, x)),
is.log = TRUE,
var = brca.cancer.se$purity_HTseq_FPKM,
method = 'spearman',
n.cores = n.cores,
group = 'purity'
)
})
names(corr.genePurity.all) <- normalizations
### within the PAM50
corr.genePurity.pam50Genefu <- lapply(
pam50.levels,
function(x){
corr.all <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
.corr.gene.variable(
expr.data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se[, index],
normalizations[y])
),
is.log = TRUE,
variable = brca.cancer.se$purity_HTseq_FPKM.UQ[index],
n.cores = n.cores,
method = 'spearman',
group = 'purity')
})
names(corr.all) <- normalizations
corr.all
})
names(corr.genePurity.pam50Genefu) <- pam50.levels6.4.3 Gene expression and RLE medians
Because of the sensitivity of RLE plots to unwanted variation, we also examine the relationships between RLE medians with individual gene expression levels in the datasets. In the absence of any influence of unwanted variation in the data we should see no such associations.
We perform Spearman correlation between individual gene expression and median of the RLE obtained for each dataset.
### across all samples
corr.geneMedRLe.all <- lapply(
normalizations,
function(x){
.corr.gene.variable(
expr.data = as.matrix(SummarizedExperiment::assay(brca.cancer.se, x)),
is.log = TRUE,
var = rle.all[[x]]$rle.med,
method = 'spearman',
n.cores = n.cores,
group = 'RleMed'
)
})
names(corr.geneMedRLe.all) <- normalizations
### within the PAM50
corr.geneMedRLE.pam50Genefu.all <- lapply(
pam50.levels,
function(x){
corr.all <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
.corr.gene.variable(
expr.data = as.matrix(
SummarizedExperiment::assay(
brca.cancer.se[, index],
normalizations[y])
),
is.log = TRUE,
method = 'spearman',
variable = rle.pam50Genefu.all[[x]][[y]]$rle.med,
n.cores = n.cores,
group = 'RleMed')
})
names(corr.all) <- normalizations
corr.all
})
names(corr.geneMedRLE.pam50Genefu.all) <- pam50.levels6.5 Pair-wise correlation
6.5.1 Tumour purity
6.5.1.1 Pair-wise correlation
Here, we select genes (~1200) that have high negative correlation with tumor purity (< -0.6) in the TCGA FPKM.UQ data. Then, we compute pair-wise correlation between all possible pairs of the genes.
### gene selection
purity.genes <- lapply(
levels(brca.cancer.se$pam50.geneFu.fpkmUq),
function(x){
index <- corr.genePurity.pam50Genefu[[x]]$HTseq_FPKM.UQ$purity_rho < -.6
corr.genePurity.pam50Genefu[[x]]$HTseq_FPKM.UQ$purity_genes[index]
})
purity.genes <- unique(unlist(purity.genes))
# length(purity.genes) # 1263 genes
### pair-wise correlation
pair.wise.purityGenes <- combn(
purity.genes,
2)
future::plan("multiprocess", workers = 5)
corr.genePurity.pairWise.all <- lapply(
normalizations[c(3:4)],
function(y){
t.data <- t(SummarizedExperiment::assay(
brca.cancer.se[purity.genes , ],
y))
corr.coef <- future.apply::future_lapply(
c(1:ncol(pair.wise.purityGenes)),
function(x) {
gene <- pair.wise.purityGenes[, x]
cor.test(
t.data[, gene[1]],
t.data[, gene[2]],
method = 'spearman')[[4]]
})
return(unlist(corr.coef))
})
names(corr.genePurity.pairWise.all) <- normalizations[c(3:4)]
future::plan('sequential')6.5.1.2 Partial pair-wise correlation
Partial correlation is used to estimate linear correlation between two variables while controlling for another variable. We compute the partial correlation between the expression levels of pairs of the genes (~1200) controlling for tumor purity using the pcor.test() function from the ppcor R package (version 1.1).
future::plan("multiprocess", workers = 5)
purity <- brca.cancer.se$purity_HTseq_FPKM.UQ
corr.genePurity.partialPairWise.all <- lapply(
normalizations[c(3:4)],
function(y){
t.data <- t(SummarizedExperiment::assay(
brca.cancer.se[purity.genes , ],
y))
corr.coef <- future.apply::future_lapply(
c(1:ncol(pair.wise.purityGenes)),
function(x) {
gene <- pair.wise.purityGenes[, x]
ppcor::pcor.test(
x = t.data[, gene[1]],
y = t.data[, gene[2]],
z = purity,
method = 'spearman')$estimate
})
return(unlist(corr.coef))
})
names(corr.genePurity.partialPairWise.all) <- normalizations[c(3:4)]
future::plan('sequential')6.5.2 Flow cell chimstry
6.5.2.1 Pair-wise correlation
Here, we perform gene-gene correlations between all possible pairs of ~1000 genes that are highly affected by flow cell chemistry.
batch.gene.signature <- ftest.fcch.all$HTseq_FPKM.UQ$FValue > 200
# sum(batch.gene.signature) # 1039
batch.gene.signature <- ftest.fcch.all$HTseq_FPKM.UQ$Genes[
batch.gene.signature]
pair.wise.fcchGenes <- combn(
batch.gene.signature,
2
)
## pair-wise
future::plan("multiprocess", workers = 8)
corr.geneBatchScore.pairWise.all <- lapply(
normalizations[c(3:4)],
function(y){
t.data <- t(SummarizedExperiment::assay(
brca.cancer.se[batch.gene.signature , ],
y))
corr.coef <- future.apply::future_lapply(
c(1:ncol(pair.wise.fcchGenes)),
function(x) {
gene <- pair.wise.fcchGenes[, x]
cor.test(
t.data[, gene[1]],
t.data[, gene[2]],
method = 'spearman')[[4]]
})
return(unlist(corr.coef))
})
names(corr.geneBatchScore.pairWise.all) <- normalizations[c(3:4)]
future::plan('sequential')6.6 Vector correlation
We use the Rozeboom squared vector correlation to quantify the strength of (linear) relationships between two sets of variables such as the first k PCs (i.e. 1≤k≤10) and dummy variables representing time, batches, plates, and biological variables. Not only does this quantity summarize the full set of canonical correlations, but it also reduces to the familiar R2 from multiple regression when one of the variable sets contains just one element.
6.6.1 Association between PCs and PAM50 subtypes
We perform vector correlation between the first 10 PCs components and the PAM50 subtypes for individual datasets.
cca.pam50Genefu.all <- lapply(
c(1:4),
function(x){
pcs <- pca.all[[x]]$sing.val$u
pam50Genefu.dummies <- fastDummies::dummy_cols(
SummarizedExperiment::colData(brca.cancer.se)[, genefu.calls[x] ]
)
pam50Genefu.dummies <- pam50Genefu.dummies[, c(2:ncol(pam50Genefu.dummies))]
sapply(
1:10,
function(y){
cca.pam50 <- stats::cancor(
x = pcs[, 1:y, drop = FALSE],
y = pam50Genefu.dummies)
1 - prod(1 - cca.pam50$cor^2)
})
})
names(cca.pam50Genefu.all) <- normalizations6.6.2 Association between PCs and flow cell chemistry
We perform vector correlation between the first 10 PCs and the flow cell chemistry batches across all samples and within the PAM50 subtypes for individual datasets.
### across all samples
fcch.dummies <- fastDummies::dummy_cols(brca.cancer.se$FcCh)
fcch.dummies <- fcch.dummies[, c(2:ncol(fcch.dummies))]
cca.fcch.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u
sapply(
1:10,
function(y){
cca.fcch <- stats::cancor(
x = pcs[, 1:y, drop = FALSE],
y = fcch.dummies)
1 - prod(1 - cca.fcch$cor^2)
})
})
names(cca.fcch.all) <- normalizations
### within the PAM50
cca.fcch.pam50Genefu.all <- lapply(
pam50.levels,
function(x){
cca.fcch.pam50 <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
fcch.dummies <- fastDummies::dummy_cols(brca.cancer.se$FcCh[index])
fcch.dummies <- fcch.dummies[, c(2:ncol(fcch.dummies))]
pcs <- pca.pam50Genefu[[x]][[y]]$sing.val$u
sapply(
1:10,
function(z){
cca.fcch <- stats::cancor(
x = pcs[, 1:z, drop = FALSE],
y = fcch.dummies)
1 - prod(1 - cca.fcch$cor^2)
})
})
names(cca.fcch.pam50) <- normalizations
cca.fcch.pam50
})
names(cca.fcch.pam50Genefu.all) <-pam50.levels6.6.3 Association between PCs and plates
We perform vector correlation between the first 10 PCs and the plates for individual datasets.
### across all samples
plate.dummies <- fastDummies::dummy_cols(brca.cancer.se$plate_RNAseq)
plate.dummies <- plate.dummies[, c(2:ncol(plate.dummies))]
cca.plate.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u
sapply(
1:10,
function(y){
cca.fcch <- stats::cancor(
x = pcs[, 1:y, drop = FALSE],
y = plate.dummies)
1 - prod(1 - cca.fcch$cor^2)
})
})
names(cca.plate.all) <- normalizations
### within the PAM50
cca.plate.pam50Genefu.all <- lapply(
pam50.levels,
function(x){
cca.plate.pam50 <- lapply(
c(1:3),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
plate.dummies <- fastDummies::dummy_cols(brca.cancer.se$plate_RNAseq[index])
plate.dummies <- plate.dummies[, c(2:ncol(plate.dummies))]
pcs <- pca.pam50Genefu[[x]][[y]]$sing.val$u
sapply(
1:10,
function(z){
cca.plate <- stats::cancor(
x = pcs[, 1:z, drop = FALSE],
y = plate.dummies)
1 - prod(1 - cca.plate$cor^2)
})
})
names(cca.plate.pam50) <- tcga.harmonized
cca.plate.pam50
})
names(cca.plate.pam50Genefu.all) <- pam50.levels6.7 Linear regression
R2 values of fitted linear models are used to quantity the strength of the (linear) relationships between a single quantitative source of unwanted variation such as sample (log) library size or tumor purity and global sample summary statistics such as the first k PC (1≤k≤10).
6.7.1 Association between PCs and library size
We perform linear regression between the first 10 PCs and the log2 of library size for individual datasets.
### across all samples
lreg.ls.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u
tcga.ls.rSquared <- sapply(
1:10,
function(y) {
lm.ls <- summary(lm(brca.cancer.se$libSize ~ pcs[, 1:y]))$r.squared
})
})
names(lreg.ls.all) <- normalizations
### within the PAM50
lreg.ls.pam50Genefu.all <- lapply(
pam50.levels,
function(x){
ls.reg <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
pcs <- pca.pam50Genefu[[x]][[y]]$sing.val$u
tcga.pam50.ls <- sapply(
1:10,
function(z){
lm.ls <- summary(lm(brca.cancer.se$libSize[index] ~ pcs[, 1:z]))$r.squared
lm.ls
})
})
names(ls.reg) <- normalizations
ls.reg
})
names(lreg.ls.pam50Genefu.all) <- pam50.levels6.7.2 Association between PCs and tumour purity
We perform linear regression between the first 10 PCs and the tumour purity scores for individual datasets.
### across all samples
lreg.purity.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u
ls.rSquared <- sapply(
1:10,
function(y) {
lm.ls <- summary(lm(brca.cancer.se$purity_HTseq_FPKM.UQ ~ pcs[, 1:y]))$r.squared
})
})
names(lreg.purity.all) <- normalizations
### within the PAM50
lreg.purity.pam50Genefu.all <- lapply(
pam50.levels,
function(x){
purity.reg <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[y]] == x
pcs <- pca.pam50Genefu[[x]][[y]]$sing.val$u
pam50.purity <- sapply(
1:5,
function(z){
lm.purity <- summary(lm(
brca.cancer.se$purity_HTseq_FPKM.UQ[index] ~ pcs[, 1:z])
)$r.squared
lm.purity
})
})
names(purity.reg) <- normalizations
purity.reg
})
names(lreg.purity.pam50Genefu.all) <- pam50.levels6.8 Differential expression analysis
Differential expression (DE) analysis were performed using the Wilcoxon signed-rank test with log2 transformed raw counts and normalized data. To evaluate the effects of the different sources of unwanted variation on the data, differential expression analyses were performed across batches. In the absence of any batch effects, the histogram of the resulting unadjusted p-values should be uniformly distributed.
6.8.1 DE analysis between samples with low and high library size
We divide samples into low and high library size based on the median of the library size and then we perform DE analysis between the groups.
brca.cancer.se$ls.status <- ifelse(
brca.cancer.se$libSize > median(brca.cancer.se$libSize),
'high',
'low'
)
de.ls.pam50Genefu.all <- lapply(
pam50.levels,
function(x){
de <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se
)[ , genefu.calls[y]] == x
data <- as.matrix(SummarizedExperiment::assay(
brca.cancer.se[ , index],
normalizations[y])
)
.wilcoxon.test(
expr.data = data,
is.log = TRUE,
variable = brca.cancer.se$ls.status[index],
n.cores = n.cores)
})
names(de) <- normalizations
de
})
names(de.ls.pam50Genefu.all) <- pam50.levels6.8.2 DE analysis between samples with low and high tumour purity
We divide samples into low and high tumour purity based on the median of the tumour purity and then we perform DE analysis between the groups.
brca.cancer.se$purity.status <- ifelse(
brca.cancer.se$purity_HTseq_FPKM > median(brca.cancer.se$purity_HTseq_FPKM),
'high',
'low'
)
de.purity.pam50Genefu.all <- lapply(
pam50.levels,
function(x){
de <- lapply(
c(1:4),
function(y){
index <- SummarizedExperiment::colData(
brca.cancer.se
)[ , genefu.calls[y]] == x
data <- as.matrix(SummarizedExperiment::assay(
brca.cancer.se[ , index],
normalizations[y])
)
.wilcoxon.test(
expr.data = data,
is.log = TRUE,
variable = brca.cancer.se$purity.status[index],
n.cores = n.cores)
})
names(de) <- normalizations
de
})
names(de.purity.pam50Genefu.all) <- pam50.levels6.9 Silhouette coefficient analysis
We use Silhouette coefficients analysis to assess the separation of biological populations and batch effects. The silhouette function uses Euclidean distance to calculate both the similarity between one patient and the other patients in each cluster, and the separation between patients in different clusters. A better normalization method will lead to higher and lower silhouette coefficients for biological and batch labels respectively.We use the first three PCs to compute silhouette coefficients for individual datases.
6.9.1 Association between PCs and plates
### Cancer samples
silCoef.plate.all <- lapply(
normalizations,
function(x){
.silhouette.coeff(
pcs = pca.all[[x]]$sing.val$u,
variable = brca.cancer.se$plate_RNAseq,
nPCs = 3)
})
names(silCoef.plate.all) <- normalizations6.9.2 Association between PCs and flow cell chemistery
### Cancer samples
silCoef.fcch.all <- lapply(
normalizations,
function(x){
.silhouette.coeff(
pcs = pca.all[[x]]$sing.val$u,
variable = brca.cancer.se$FcCh,
nPCs = 3)
})
names(silCoef.fcch.all) <- normalizations6.9.3 Association between PCs and PAM50
silCoef.pam50Genefu.all <- lapply(
c(1:4),
function(x){
.silhouette.coeff(
pcs = pca.all[[x]]$sing.val$u,
variable = SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[x]],
nPCs = 3)
})
names(silCoef.pam50Genefu.all) <- normalizations6.10 ARI
The Adjusted Rand Index [57] is the corrected-for-chance version of the Rand Index. The ARI measures the percentage of matches between two label lists. We used the ARI to assess the performance of normalization methods in terms of sample subtypes separation and batch mixing. We first calculated principal components and used the first 3 PC to perform ARI.
6.10.1 Association between PCs and PAM50
set.seed(2012190737)
ari.pam50Genefu.all <- lapply(
c(1:4),
function(x){
pcs <- pca.all[[x]]$sing.val$u[,1:3]
BIC <- mclust::mclustBIC(data = pcs)
mod <- mclust::Mclust(data = pcs, x = BIC, G = 5)
mclust::adjustedRandIndex(
mod$classification,
SummarizedExperiment::colData(brca.cancer.se)[, genefu.calls[x]]
)
})
names(ari.pam50Genefu.all) <- normalizations6.10.2 Association between PCs and flow cell chemistry
set.seed(2012190737)
ari.fcch.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u[,1:3]
BIC <- mclust::mclustBIC(data = pcs)
mod <- mclust::Mclust(data = pcs, x = BIC, G = 2)
mclust::adjustedRandIndex(
mod$classification,
brca.cancer.se$FcCh
)
})
names(ari.fcch.all) <- normalizations6.11 Tmour purity estimates
We estimated tumor purity for all TCGA RNA-seq cancer samples using the stromal and immune gene signatures from Yoshihara et al and the R/Bioconductor package singscore version (1.12.0). The stromal&immune scores were transformed to 1–stromal&immune scores for down-stream analyses. These measurements are called tumor purity scores in this study. The tumor purity scores showed high positive correlation (mean = 0.95, Pearson correlation) with the ESTIMATE measurements¬ from Aran et al.
purity.gene.sig <-
SummarizedExperiment::rowData(brca.cancer.se)$stromal == 'yes' |
SummarizedExperiment::rowData(brca.cancer.se)$immnue == 'yes'
purity.estimates <- lapply(
normalizations[c(3:4)],
function(x){
rank.data <- singscore::rankGenes(
as.matrix(SummarizedExperiment::assay(
brca.cancer.se,
x)))
singscore::simpleScore(
rankData = rank.data,
upSet = row.names(brca.cancer.se)[purity.gene.sig],
centerScore = F
)$TotalScore
})
names(purity.estimates) <- c(
'FPKM.UQ',
'RUV-III')6.12 Unknown batches
An expression heatmap of the most highly affected genes by the flow cell chemistries reveals two clusters within the samples processed by the first flow cell chemistry (section 7.1.5). To explore this more fully, we take the set of most highly affected genes by the flow cell chemistries and scored samples against this gene set (hereafter called the batch score) using the R/Bioconductor package singscore on the FPKM.UQ normalized dataset.
6.12.1 Selection of genes
We select genes that have F-statistc higher than 300 obtained from ANOVA analysis berween genes and flow cell chemistry in the TCGA FPKM.UQ data.
### Batch gene signatures
fcch.gene.signature <- ftest.fcch.all$HTseq_FPKM.UQ$FValue > 300
fcch.gene.signature <- ftest.fcch.all$HTseq_FPKM.UQ$Genes[fcch.gene.signature]6.12.2 Scoring samples
Then we use the selected genes to score the TCGA FPKM.UQ and RUV-III normalized data.
fcch.scores <- lapply(
normalizations[c(3:4)],
function(x){
rank.data <- singscore::rankGenes(
as.matrix(SummarizedExperiment::assay(
brca.cancer.se,
x))
)
singscore::simpleScore(
rankData = rank.data,
upSet = fcch.gene.signature
)
})
names(fcch.scores) <- normalizations[c(3:4)]6.12.3 Clustering of the scores
Here, we use different cut-offs to divide samples into 4 groups (sub-batches) for down-stream analysis.
g1 <- fcch.scores$HTseq_FPKM.UQ$TotalScore < .025
g4 <- fcch.scores$HTseq_FPKM.UQ$TotalScore > .17
g3 <- fcch.scores$HTseq_FPKM.UQ$TotalScore <= .17 &
fcch.scores$HTseq_FPKM.UQ$TotalScore > .12
brca.cancer.se$sub.batches <- 'group2'
brca.cancer.se$sub.batches[g1] <- 'group1'
brca.cancer.se$sub.batches[g4] <- 'group4'
brca.cancer.se$sub.batches[g3] <- 'group3'
brca.cancer.se$sub.batches <- factor(
brca.cancer.se$sub.batches,
levels = c(
'group1',
'group2',
'group3',
'group4'))6.12.4 Correlation between genes and scores
Further, we perform Spearman correlation analysis between genes expression and the batch score to see how many genes are affected by this complex batch notion (flow cell chemistry and unknown sources).
corr.geneFcchScores <- lapply(
normalizations[c(3:4)],
function(x){
.corr.gene.variable(
expr.data = as.matrix(SummarizedExperiment::assay(brca.cancer.se, x)),
is.log = TRUE,
variable = fcch.scores[[x]]$TotalScore,
method = 'spearman',
n.cores = n.cores,
group = 'fcch.score')
})
names(corr.geneFcchScores) <- normalizations[c(3:4)]7 Normalization performance assessments
7.1 Tumour purity
As with most of the other TCGA RNA-seq studies (R.Molania, bioRxiv, 2021), tumor purity is one of the major sources of variation in the BRCA study. For this dataset, we designed our PRPS in order to remove the effects of tumor purity as well as other technical variation.
7.1.1 Linear regression
Figure 7.1 shows that the RUV-III normalization mitigate the tumor purity variation in the TCGA BRCA RNA-seq data. The FPKM and FPKM.UQ normalization method are not designed to remove tumor purity variation from cancer gene expression data.
lreg.purity <- as.data.frame(lreg.purity.all) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III
) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'r.sq') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III'))
)
ggplot(lreg.purity, aes(x = pcs, y = r.sq, group = datasets)) +
geom_line(aes(color = datasets)) +
geom_point(aes(color = datasets)) +
xlab('PCs') + ylab (expression('R'^'2')) +
scale_color_manual(values = c(dataSets.colors), name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 16, angle = 35, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 16),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14))Figure 7.1: A plot showing the R-squared of linear regression between tumour purity and up to the first 10 principal components (taken cumulatively) for different normalization methods.
Figure 7.2 the linear regression between tumor purity and the first 10 PCs within each PAM50 subtype for differently normalized data.
pcs.purity.lnreg.pam50 <- lapply(
pam50.levels,
function(x){
r.seq <- lapply(
normalizations,
function(y){
lreg.purity.pam50Genefu.all[[x]][[y]]
})
names(r.seq) <- paste(normalizations, x, sep = '__')
do.call(cbind, r.seq)
})
names(pcs.purity.lnreg.pam50) <- pam50.levels
pcs.purity.lnreg.pam50 <- do.call(cbind, pcs.purity.lnreg.pam50) %>%
data.frame(.) %>%
dplyr::mutate(pcs = c(1:5)) %>%
tidyr::pivot_longer(
-c(pcs),
names_to = 'datasets',
values_to = 'r.sq') %>%
tidyr::separate(
col = datasets,
sep ='__',
into = c('datasets', 'pam50' )) %>%
data.frame(.) %>%
dplyr::mutate(
type = recode(
datasets,
'HTseq_counts' = 'Raw counts',
'HTseq_FPKM' = 'FPKM',
'HTseq_FPKM.UQ' ='FPKM.UQ',
'RUV_III' = 'RUV-III'
))
### plot
ggplot(pcs.purity.lnreg.pam50 , aes(x = pcs, y = r.sq)) +
geom_line(aes(color = type)) +
geom_point(aes(color = type)) +
xlab('') +
ylab (expression('R'^'2')) +
scale_color_manual(values = dataSets.colors, name = 'Datasets') +
scale_x_continuous(breaks = (1:5), labels = c('PC1', paste0('PC1:', 2:5)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
facet_wrap(~pam50 , ncol = 3) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 16),
axis.text.x = element_text(size = 10, angle = 35, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 16),
strip.text = element_text(size = 14))Figure 7.2: A plot showing the R-squared of linear regression between tumour purity and up to the first 5 principal components (taken cumulatively) within each PAM50 subtype.
7.1.2 Correlations analysis between gene expression and tumor purity
Figure 7.3 shows Spearman correlation coefficients between the gene expression levels and tumor purity within each PAM50 subtype in differently normalized data.The RUV-III normalization reduced the association between and expression and purity in the TCGA BRCA RNA-Seq data.
corr.genePurity.pam50Genefu <- lapply(
pam50.levels,
function(x){
r.seq <- lapply(
c(1:4),
function(y){
corr.genePurity.pam50Genefu[[x]][[y]]$purity_rho
})
names(r.seq) <- paste(normalizations, x, sep = '__')
do.call(cbind, r.seq)
})
names(corr.genePurity.pam50Genefu) <- pam50.levels
corr.genePurity.pam50Genefu <- do.call(
cbind,
corr.genePurity.pam50Genefu) %>%
data.frame(.) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'corr.coef') %>%
tidyr::separate(
col = datasets,
sep ='__',
into = c('datasets', 'pam50')) %>%
data.frame(.) %>%
dplyr::mutate(
type = recode(
datasets,
'HTseq_counts' = 'Raw counts',
'HTseq_FPKM' = 'FPKM',
'HTseq_FPKM.UQ' ='FPKM.UQ',
'RUV_III' = 'RUV-III'))
### Plot
ggplot(corr.genePurity.pam50Genefu, aes(x = pam50, y = corr.coef, fill = type)) +
geom_boxplot(outlier.color = 'white') +
ylab("Spearman correlation") +
xlab('PAM50 subtypes') +
scale_fill_manual(values = dataSets.colors) +
guides(fill = guide_legend("Datasets")) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 22),
axis.title.y = element_text(size = 22),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 16),
axis.text.y = element_text(size = 16),
legend.text = element_text(size = 14),
legend.title = element_text(size = 16),
strip.text.x = element_text(size = 10))Figure 7.3: Boxplots of Spearman correlation coefficients between the gene expression levels and tumour purity within each PAM50 subtype in differently normalized data.
7.1.3 DE analysis between sample with low and high tumour purity
Further, we evaluate the effects of tumor purity variation on the data using differential expression (DE) analyses between sample with low and high tumor purity. DE analyses were performed using the Wilcoxon signed-rank test with log2 transformed of the raw counts and differently normalized datasets. In the absence of tumor purity variation, the histogram of the resulting un-adjusted p-values should be uniformly distributed(Figure 7.4).
de.purity.pam50Genefu <- lapply(
pam50.levels[1:4],
function(x){
pval <- lapply(
normalizations[c(3,4)],
function(y){
de.purity.pam50Genefu.all[[x]][[y]]$pvalue
})
pval <- as.data.frame(do.call(cbind, pval))
colnames(pval) <- c(
'FPKM.UQ',
'RUV-III')
pval <- pval %>%
tidyr::pivot_longer(
everything(),
names_to = 'Datasets',
values_to = 'Pvalues') %>%
dplyr::mutate(
Datasets = factor(
Datasets,
levels = c(
'FPKM.UQ',
'RUV-III'))) %>%
data.frame()
ggplot(pval, aes(x = Pvalues, fill = Datasets)) +
geom_histogram(binwidth = 0.1) +
scale_x_continuous(breaks = c(seq(0,1,.5))) +
xlab('p_values') +
ylab('Frequency') +
scale_fill_manual(values = dataSets.colors[3:4]) +
ggtitle(x) +
# facet_wrap(~Datasets, ncol = 4) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 18),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14))
})
names(de.purity.pam50Genefu) <- pam50.levels[1:4]
do.call(
ggpubr::ggarrange,
c(
de.purity.pam50Genefu,
ncol = 4,
common.legend = TRUE,
legend="right"))Figure 7.4: p-value histograms of differential expression analysis between samples with low and high tumor purity within the four main PAM50 subtypes in the FPKM.UQ and the RUV-III normalized datasets.
7.1.4 Gene co-expresseion analyses
Figure 7.5 shows that the gene expression levels of ZEB2 and ETS1 are both highly correlated with tumor purity. The ZEB2 gene is a one of the regulators of the epithelial‐mesenchymal transition process that induces invasion of cancer cells. ETS1 is member of a large family of transcription factors characterized by their ETS DNA‐ binding domain. The gene appears to have dichotomous roles as an oncogene and a tumor suppressor gene in different cancer types. The high correlation of the ETS1 with the ZEB2 in the TCGA BRCA RNA-seq data may confirm its oncogene role, but this is most likely a consequence of their correlations with tumor purity. The RUV-III normalized data and the breast cancer laser microdissection microarray data showed that the expression levels of these two genes are uncorrelated (Figure 7.5).
selected.genes <- c('ZEB2', 'ETS1')
expr.genes <- lapply(
normalizations[c(3:4)],
function(x){
t(SummarizedExperiment::assay(
brca.cancer.se[selected.genes,],
x))
})
expr.genes <- as.data.frame(do.call(cbind, expr.genes))
colnames(expr.genes) <- paste0(
colnames(expr.genes),
rep(c('tcga', 'ruv'), each = 2)
)
expr.genes$purity <- brca.cancer.se$purity_HTseq_FPKM.UQ
conditions <-
c('ZEB2tcga',
'purity',
'ETS1tcga',
'purity',
'ZEB2tcga',
'ETS1tcga',
'ZEB2ruv',
'ETS1ruv')
lables.cond <-
c('Gene expression (ZEB2)',
'Tumour purity score',
'Gene expression (ETS1)',
'Tumour purity score',
'Gene expression (ZEB2)',
'Gene expression (ETS1)',
'Gene expression (ZEB2)',
'Gene expression (ETS1)')
pp <- lapply(
c(1,3,5,7),
function(x){
ggplot(
expr.genes, aes(x = expr.genes[, conditions[x]], y = expr.genes[, conditions[x + 1]])) +
geom_point(aes(colour = purity)) +
scale_colour_gradientn(
colours = viridis::viridis(n =10),
name = 'Tumour purity') +
xlab(lables.cond[x]) +
ylab(lables.cond[x + 1]) +
ggpubr::stat_cor(
aes(label = ..r.label..),
label.x.npc = .4,
label.y.npc = 1,
hjust = 0,
size = 7,
r.accuracy = .1,
col = 'black',
cor.coef.name = "rho"
) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1.2),
legend.position = "bottom",
axis.text.x = element_text(size = 8),
legend.key.width = unit(.5, "cm"),
legend.key.height = unit(.2, "cm"),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
legend.text = element_text(size = 8),
legend.title = element_text(size = 12)) +
guides(colour = guide_colourbar(title.position = "top"))})
### Microarray
df.micro.78958 <- data.frame(
ZEB2 = expr.78958[ '9839_ZEB2' , ],
ETS1 = expr.78958[ '2113_ETS1' , ]
)
p.micro.78958 <- ggplot(df.micro.78958, aes(x = ZEB2, y = ETS1)) +
geom_point() +
scale_colour_gradientn(
colours = viridis::viridis(n =10),
name = 'Tumour purity') +
xlab('Gene expression (ZEB2)') +
ylab('Gene expression (ETS1)') +
ggpubr::stat_cor(
aes(label = ..r.label..),
label.x.npc = .4,
label.y.npc = 1,
hjust = 0,
r.accuracy = .1,
size = 7,
col = 'black',
cor.coef.name = "rho") +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1.2),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
axis.title.x = element_text(size = 10),
axis.title.y = element_text(size = 10)
)
pp[[5]] <- p.micro.78958
gridExtra::grid.arrange(
pp[[1]],
pp[[2]],
pp[[3]],
pp[[4]],
pp[[5]],
layout_matrix = matrix(c(1, 2, 3, 4, 5, 'NA'), 3, 2, byrow = T))Figure 7.5: First row: relationship between tumor purity scores and the ZEB2 and ETS1 gene expression in the FPKM data. Second row: Scatter plots exhibit relationship between the ZEB2 and ETS1 gene expression in the FPKM data (left) and the RUV-III normalized data (right). Third row: Scatter plot shows the relationship between the ZEB2 and ETS1 gene expression in the laser capture microdissection microarray data.
7.1.5 Partial pairwise correlation analysis
To extend this observation, we selected 1300 genes whose gene expression levels are highly correlated with tumor purity and then calculated Spearman correlation between all possible pairs of these genes. In a matching analysis, we computed partial correlations between these pairs adjusting for tumor purity. Figure 7.6 shows that there are many gene pairs that have high correlations, but these are mostly likely a consequence of their correlation with tumor purity.
### All
all.corr.data <- data.frame(
ppcor.tcga = corr.genePurity.partialPairWise.all$HTseq_FPKM.UQ,
spearman.tcga = corr.genePurity.pairWise.all$HTseq_FPKM.UQ,
ppcor.ruv = corr.genePurity.pairWise.all$RUV_III,
spearman.ruv = corr.genePurity.partialPairWise.all$RUV_III
)
p1 <- ggplot(all.corr.data, aes(x = ppcor.tcga, y = spearman.tcga)) +
geom_hex() +
geom_abline(slope = 1, intercept = 0) +
ylim(-.8,1) +
xlim(-.8,1) +
xlab('Partial correlation') +
ylab('Correlation') +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 8),
legend.title = element_text(size = 8)
)
p2 <- ggplot(all.corr.data, aes(x = ppcor.ruv, y = spearman.ruv)) +
geom_hex() +
geom_abline(slope = 1, intercept = 0) +
ylim(-.8,1) +
xlim(-.8,1) +
xlab('Partial correlation') +
ylab('Correlation') +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 8),
legend.title = element_text(size = 8)
)
gridExtra::grid.arrange(p1, p2, ncol = 2)Figure 7.6: Scatter plots show the Spearman correlation coefficients and partial correlation coefficients for all possible pairs of the genes that have the 1300 highest correlations with tumor purity in the TCGA FPKM.UQ (left) and RUV-III normalized data (right).
7.1.6 Gene expression and survival analysis
Variation in tumor purity can also affect the association between gene expression levels and survival outcomes. For example, the expression of the ZEB2 gene shows to be associated with cancer progression and survival outcome in different cancer types. The RUV-III normalization revealed that high expression of the ZEB2 gene is associated with a poor outcome in the TCGA BRCA RNA-seq data, but this was obscured by variation in tumor purity in the FPKM.UQ normalized data (figure 7.7).
Another example is the Stabilin 1 (STAB1) gene, whose expression levels are associated with survival in several cancer types, including breast cancer. However, this association was only evident in the present data after removing variation in tumor purity. We found many more examples of such genes using the RUV-III normalized data (Figure 7.7)).
selected.genes <- c(
'TGFBR2',
'ZEB2',
'STAB1',
'ESRRA')
pp.tcga <- lapply(
selected.genes,
function(x){
p.survival.purity.tcga <- survival_plot(
data = SummarizedExperiment::assay(brca.cancer.se, 'HTseq_FPKM.UQ'),
stratify = 'expr',
annot = SummarizedExperiment::colData(brca.cancer.se),
scoreCol = NULL,
gene = x,
covariate = NULL,
isCategoricalCov = FALSE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = 2,
mainTitle1 = x,
confInt = FALSE,
ylabel = "Survival",
cols = c(brewer.pal(9, "Set1"))[c(2, 3)],
nColLegend = 1,
plotType = "autoplot"
)$plot +
surv.theme
})
pp.ruv <- lapply(
selected.genes,
function(x){
p.survival.purity.tcga <- survival_plot(
data = SummarizedExperiment::assay(brca.cancer.se, 'RUV_III'),
stratify = 'expr',
annot = SummarizedExperiment::colData(brca.cancer.se),
scoreCol = NULL,
gene = x,
covariate = NULL,
isCategoricalCov = FALSE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = 2,
mainTitle1 = x,
confInt = FALSE,
ylabel = "Survival",
cols = c(brewer.pal(9, "Set1"))[c(2, 3)],
nColLegend = 1,
plotType = "autoplot"
)$plot +
surv.theme
})
gridExtra::grid.arrange(
pp.tcga[[1]],
pp.ruv[[1]],
pp.tcga[[2]],
pp.ruv[[2]],
pp.tcga[[3]],
pp.ruv[[3]],
pp.tcga[[4]],
pp.ruv[[4]],
ncol = 2)Figure 7.7: Kaplan Meier survival analysis shows the association between several genes expression and overall survival in the FPKM.UQ (left) and the RUV-III normalized data (right).
7.1.7 Tumour purity estimates
Figure 7.8 shows the distributions of tumor purity scores in the FPKM.UQ and RUV-III normalized datasets.
purity.score <- data.frame(
FPKM.UQ = purity.estimates$FPKM.UQ,
"RUV-III" = purity.estimates$`RUV-III`
) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'purity') %>%
data.frame()
ggplot(purity.score, aes( x = purity, fill = datasets)) +
geom_histogram(alpha = 0.7, position = "identity") +
scale_fill_manual(values = dataSets.colors[c(3,4)], name = 'Datasets') +
xlab('Tumour purity scores') +
ylab('Frequency') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 20),
axis.title.y = element_text(size = 20),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))Figure 7.8: Distributions of tumor purity scores in the FPKM.UQ and RUV-III normalized datasets.
7.2 Flow cell chemistry effects
7.2.1 PCA plots
As we mentioned above, the TCGA BRCA RNA-seq samples were profiled over two batches of flow cell chemistries. PCA plots of the FPKM and FPKM.UQ normalized datasets showed noticeable variation due to the use of two flow cell chemistries (Figure 7.9), whereas RUV-III effectively removed this variation from the data (Figure 7.9).
plot.names <- c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III'
)
pp <- lapply(
c(1:4),
function(x){
pcs <- pca.all[[x]]
p <- .scatter.density.pc(
pcs = pcs$sing.val$u[,1:3],
pc.var = pcs$var,
group.name = 'Flow cell chemistry',
group = brca.cancer.se$FcCh,
color = FcCh.colors,
strokeSize = .2,
pointSize = 2,
strokeColor = 'gray30',
alpha = .6,
title = plot.names[x])
p
})
do.call(
gridExtra::grid.arrange,
c(pp[[1]],
pp[[2]],
pp[[3]],
pp[[4]],
ncol = 4))Figure 7.9: The first three PC coloured by flow cell chemistry in the TCGA BRCA raw counts, FPKM, FPKM.UQ and RUV-III normalized data.
7.2.2 Vector correlation analysis
Figure 7.10 shows the vector correlation analysis between the flow cell chemistry batches and the first 10 principal components for different normalization methods. Ideally, we should see no significant association between the PCs and the batches.
pcs.fcch.cca <- as.data.frame(cca.fcch.all) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III
) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'cca.coef') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III')))
## plot
ggplot(pcs.fcch.cca, aes(x = pcs, y = cca.coef, group = datasets)) +
geom_line(aes(color = datasets)) +
geom_point(aes(color = datasets)) +
xlab('') + ylab ("Vector correlation") +
scale_color_manual(values = c(dataSets.colors), name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
axis.text.x = element_text(size = 12, angle = 35, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14))Figure 7.10: A plot showing the vector correlation coefficient between the flow cell chemistry and up to the first 10 principal components.
Future 7.11 shows the vector correlation analysis between the flow cell chemistry batches and the first 10 principal components within each PAM50 subtypes for different normalization methods. These plots (7.11) clearly show that the RUV-III normalization removes the effects of using two flow cell chemistries in the TCGA BRCA RNA-Seq data.
pcs.fcch.cca.pam50 <- lapply(
pam50.levels,
function(x){
r.seq <- lapply(
normalizations,
function(y){
cca.fcch.pam50Genefu.all[[x]][[y]]
})
names(r.seq) <- paste(normalizations, x, sep = '__')
do.call(cbind, r.seq)
})
names(pcs.fcch.cca.pam50) <- pam50.levels
pcs.fcch.cca.pam50 <- do.call(cbind, pcs.fcch.cca.pam50) %>%
data.frame(.) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-c(pcs),
names_to = 'datasets',
values_to = 'cca.coef') %>%
tidyr::separate(
col = datasets,
sep ='__',
into = c('datasets', 'pam50' )) %>%
data.frame(.) %>%
dplyr::mutate(
type = recode(
datasets,
'HTseq_counts' = 'Raw counts',
'HTseq_FPKM' = 'FPKM',
'HTseq_FPKM.UQ' ='FPKM.UQ',
'RUV_III' = 'RUV-III'
))
### plot
ggplot(pcs.fcch.cca.pam50 , aes(x = pcs, y = cca.coef)) +
geom_line(aes(color = type)) +
geom_point(aes(color = type)) +
xlab('PCs') +
ylab ('Vector correlation') +
scale_color_manual(values = dataSets.colors, name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
facet_wrap(~pam50, ncol = 3) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10),
legend.position = 'bottom',
strip.text = element_text(size = 14)) +
guides(col = guide_legend(nrow = 2))Figure 7.11: A plot showing the vector correlation coefficient between flow cell chemistery and up to the first 10 principal components within each PAM50 subtype.
7.2.3 ANOVA
Figure 7.12 shows log2 F-statistics obtained from ANOVA analysis. The RUV-III normalization reduced the number of genes that are highly affected by the flow cell chemistry batches.
ftest.fcch <- lapply(
normalizations,
function(x) ftest.fcch.all[[x]]$FValue) %>%
do.call(
cbind, . )
colnames(ftest.fcch) <- c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III'
)
ftest.fcch <- ftest.fcch %>%
data.frame(.) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datastes',
values_to = 'fval'
)
ggplot(ftest.fcch, aes(x = datastes, y = log2(fval))) +
Ipaper::geom_boxplot2() +
ylab(expression(Log[2]~'F statistics')) +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black'),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14))Figure 7.12: Boxplots of log2 F statistics obtained from ANOVA for gene expression with flow cell chemistry batche as a factor.
7.2.4 Silhouette coefficient and ARI index analyses
Figure 7.13 shows that the RUV-III normalization outperforms the other normalization in mixing samples from the two flow cell chemistry batches.
names(silCoef.fcch.all) <- normalizations.names
silCoef.fcch <- as.data.frame(silCoef.fcch.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'silhou.coff')
p.sil.fcch <- ggplot(silCoef.fcch, aes(x = datasets, y = silhou.coff)) +
geom_col() +
ylab('Silhouette coefficient') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12))
ari.fcch <- as.data.frame(ari.fcch.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'ari')
p.ari.fcch <- ggplot(ari.fcch, aes(x = datasets, y = ari)) +
geom_col() +
ylab('ARI') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12))
gridExtra::grid.arrange(
p.sil.fcch,
p.ari.fcch,
ncol = 2)Figure 7.13: Silhouette coefficients and ARI index for exhibiting the mixing of samples from two flow cell chemistry batches.
7.2.5 Expression heatmap of affected genes by flow cell chemistry
An expression heatmap of genes most highly affected genes by the flow cell chemistries shows that different genes are affected in different ways (Figure 7.14).
Interestingly, the heatmap also reveals two clusters within the samples processed by the first flow cell chemistry. This suggests that there are additional sources of unwanted variation of unknown origin within each flow cell chemistry.
selected.genes <- ftest.fcch.all$HTseq_FPKM.UQ$FValue > 300
selected.genes <- ftest.fcch.all$HTseq_FPKM.UQ$Genes[selected.genes]
column_ha = ComplexHeatmap::HeatmapAnnotation(
'Flow cell chemistry' = brca.cancer.se$FcCh,
col = list(
'Flow cell chemistry' = FcCh.colors
),
annotation_name_gp = grid::gpar(fontsize = 14),
annotation_legend_param = list(
title_gp = grid::gpar(fontsize = 14)
))
h.data <- SummarizedExperiment::assay(brca.cancer.se[selected.genes , ], 'HTseq_FPKM.UQ')
h.fcch <- ComplexHeatmap::Heatmap(
matrix = t(scale(t(h.data), scale = T, center = T)),
top_annotation = column_ha,
cluster_rows = TRUE,
cluster_columns = FALSE,
show_row_names = FALSE,
show_column_names = FALSE,
col = RColorBrewer::brewer.pal(
n = 11, name = 'BrBG')[c(1:2,6,10:11)],
heatmap_legend_param = list(
at = c(seq(-6,6,3)),
title = 'Expression',
title_gp = grid::gpar(fontsize = 14)))
ComplexHeatmap::draw(
h.fcch,
heatmap_legend_side = "right",
annotation_legend_side = "right",
merge_legend = TRUE)Figure 7.14: Gene expression heatmap of the 367 genes most highly affected by the flow cell chemistry change in the TCGA FPKM.UQ data (rows are genes expression, columns are samples in chronological order of sample processing.
To explore this more fully, we take the set of genes most highly affected by the flow cell chemistries and scored samples against this gene set (hereafter called the batch score) using the R/Bioconductor package singscore on the FPKM.UQ normalized dataset (Figure 7.15). Batch scores clearly distinguished samples from the flow cell chemistry batches and separated the samples into clusters within each flow cell chemistry. We then use arbitrary cut-offs to divide the samples into 4 groups based on their batch scores(Figure 7.15). These groups were not visible in the batch scores obtained from the RUV-III normalized data (Figure 7.15).
batch.scores <- data.frame(
FPKM.UQ = fcch.scores$HTseq_FPKM.UQ$TotalScore,
RUV.III = fcch.scores$RUV_III$TotalScore,
samples = c(1:1086),
batch = as.factor(brca.cancer.se$sub.batches)
) %>%
tidyr::pivot_longer(
-c(samples, batch),
values_to = 'scores',
names_to = 'data.sets') %>%
dplyr::mutate(data.sets = case_when(
data.sets == 'RUV.III' ~ 'RUV-III',
data.sets != 'RUV.III' ~ 'FPKM.UQ',
)) %>%
data.frame(.)
ggplot(batch.scores, aes(x = samples, y = scores, color = batch )) +
geom_point() +
scale_color_manual(values = sub.batches.color[1:4], name = 'Batch') +
ylab('Batch scores') +
xlab('Samples') +
ylim(c(-.1, .26)) +
facet_wrap(~data.sets) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1.2),
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
legend.text = element_text(size = 12),
legend.title = element_text(size = 12),
legend.key = element_blank(),
strip.text.x = element_text(size = 16)
) +
guides(color = guide_legend(override.aes = list(size = 3)))Figure 7.15: Batch scores across samples in the FPKM.UQ (left) and RUV-III (right) normalized datasets. The batch scores were calculated by the singscore method using the 143 genes described above. Samples were divided into 4 groups based on their batch scores.
7.2.6 Spurious correlation between genes
The complex unwanted variation arising from the change in flow cell chemistry and the unknown source noted above clearly compromises estimates of gene co-expression in the FPKM.UQ normalized dataset. It introduces correlations between pairs of genes that are most likely not correlated. For example, the expression levels of the Estrogen Related Receptor Alpha (ESRRA) and Mitogen-Activated Protein Kinase Kinase Kinase 2 (MAP3K2) genes are positively correlated in this dataset (7.16), however this correlation seems to be a consequence of the unwanted variation in the data (7.16), for we do not see it in either the RUV-III normalized data or the TCGA BRCA microarray data (7.16).
selected.genes <- c('ESRRA', 'MAP3K2')
expr.genes <- lapply(
normalizations[c(3:4)],
function(x){
t(SummarizedExperiment::assay(
brca.cancer.se[selected.genes,],
x))
})
expr.genes <- as.data.frame(do.call(cbind, expr.genes))
colnames(expr.genes) <- paste0(
colnames(expr.genes),
rep(c('tcga', 'ruv'), each = 2)
)
expr.genes$sub.batches <- brca.cancer.se$sub.batches
expr.genes$batch.score <- fcch.scores$HTseq_FPKM.UQ$TotalScore
conditions <-
c('ESRRAtcga',
'batch.score',
'MAP3K2tcga',
'batch.score',
'ESRRAtcga',
'MAP3K2tcga' ,
'ESRRAruv',
'MAP3K2ruv')
lables.cond <-
c('ESRRA (gene expression)',
'Batch scores',
'MAP3K2 (gene expression)',
'Batch scores',
'ESRRA (gene expression)',
'MAP3K2 (gene expression)',
'ESRRA (gene expression)',
'MAP3K2 (gene expression)')
pp <- lapply(
c(1,3,5,7),
function(x){
ggplot(expr.genes, aes(x = expr.genes[,conditions[x]], y = expr.genes[,conditions[x +1]])) +
geom_point(
aes(fill = sub.batches),
pch = 21,
color = 'gray30',
stroke = .2,
size = 1.8
) +
scale_fill_manual(values = sub.batches.color) +
xlab(lables.cond[x]) +
ylab(lables.cond[x+1]) +
ggpubr::stat_cor(
aes(label = ..r.label..),
label.x.npc = 0.4,
label.y.npc = .98,
r.digits = 2,
p.digits = 2,
r.accuracy = 0.1,
hjust = 0,
size = 6,
col = 'black',
cor.coef.name = "rho") +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
legend.position = "none",
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 12),
legend.key = element_blank(),
strip.text.x = element_text(size = 12))
})
### micro array
expr.genes.array <- as.data.frame(t(brca.microarray[ selected.genes, ]))
p.array <- ggplot(expr.genes.array, aes(x = ESRRA, y = MAP3K2)) +
geom_point(
color = 'gray30',
stroke = .2,
size = 1.8
) +
xlab('ESRRA (gene expression)') +
ylab('MAP3K2 (gene expression)') +
ggpubr::stat_cor(
aes(label = ..r.label..),
label.x.npc = 0.4,
label.y.npc = .98,
r.digits = 2,
p.digits = 2,
r.accuracy = 0.1,
hjust = 0,
size = 6,
col = 'black',
cor.coef.name = "rho") +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
legend.position = "none",
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
plot.title = element_text(size = 12),
legend.key = element_blank(),
strip.text.x = element_text(size = 12))
pp[[5]] <- p.array
gridExtra::grid.arrange(
pp[[1]],
pp[[2]],
pp[[3]],
pp[[4]],
pp[[5]],
layout_matrix = matrix(c(1, 2, NA, 3, 4, 5), 2, 3, byrow = T))Figure 7.16: First row: Relationship between the ESSRA and MAP3K2 gene expression with the batch scores in the FPKM.UQ data. The second row: Scatter plots show the relationship between the ESSRA and MAP3K2 gene expression in the FPKM.UQ (left), the RUV-III normalized data (middle) and the TCGA BRCA microarray gene expression data (right).
7.2.7 Global co-expression analysis of highly affected genes
To extend this analysis, we first selected the genes that had the ~ 1000 highest correlations with the batch scores in the FPKM.UQ normalized data and calculated all gene-gene correlations between them in both the FPKN.UQ and RUV-III normalized datasets. Figure 7.17 shows that a large number of gene pairs have high correlations in the FPKM.UQ normalized data, something we do not see in the RUV-III normalized data.
corr.fcchGenes <- data.frame(
tcga = corr.geneBatchScore.pairWise.all$HTseq_FPKM.UQ,
ruv = corr.geneBatchScore.pairWise.all$RUV_III
)
ggplot(corr.fcchGenes, aes(x = tcga, y = ruv)) +
geom_hex() +
geom_abline(slope = 1, intercept = 0) +
ylim(-.5,.8) +
xlim(-.5,.8) +
xlab("Spearman correlation (TCGA-FPKM.UQ)") +
ylab("Spearman correlation (RUV-III)") +
geom_hline(yintercept = 0, linetype = 2) +
geom_vline(xintercept = 0, linetype = 2) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black'),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 14),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14))Figure 7.17: Scatter plots display Spearman correlation coefficients of all possible pairs of genes that highly affected by flow cell chemistries in the FPKM.UQ and the RUV-III normalized data.
7.2.8 Novel correlation between genes
Interestingly, the overall correlation between expression of the E2F Transcription Factor 4 (E2F4) and CCR4-NOT Transcription Complex Subunit 1 (CNOT1) genes is ρ = 0.1, while the average of the correlations of these genes within each of groups 1 to 4 of the unknown source of unwanted variation is ρ = 0.4 (Figure 7.18) in the FPKM.UQ normalized data. Both the RUV-III normalized and the TCGA microarray data show a high positive correlation between the expression levels of the E2F4 and CNOT1 genes.
gene.x <- 'CNOT1'
gene.y <- 'E2F4'
expr.genes <- lapply(
c(1,3,4),
function(x){
if(x == 1){
data.frame(
gene.x = brca.microarray[gene.x , ],
gene.y = brca.microarray[gene.y , ],
sub.batches = rep('group5', ncol(brca.microarray)),
dataset = rep('Microarray', ncol(brca.microarray))
)
}else{
data.frame(
gene.x = SummarizedExperiment::assay(
brca.cancer.se,
normalizations[x])[gene.x , ],
gene.y = SummarizedExperiment::assay(
brca.cancer.se,
normalizations[x])[gene.y , ],
sub.batches = brca.cancer.se$sub.batches,
dataset = rep(normalizations.names[x], ncol(brca.cancer.se))
)}
})
expr.genes <- as.data.frame(
do.call(rbind, expr.genes)
)
expr.genes$dataset <- factor(
x = expr.genes$dataset,
levels = c(
'FPKM.UQ',
'RUV-III',
'Microarray'))
ggplot(expr.genes, aes(x = gene.x, y = gene.y, color = sub.batches)) +
geom_point(aes(fill = sub.batches), pch = 21, stroke = .2) +
scale_fill_manual(values = sub.batches.color) +
geom_smooth(method = "lm", se = FALSE) +
xlab(paste0(gene.x, ' (gene expression)')) +
ylab(paste0(gene.y, ' (gene expression)')) +
facet_wrap(~ dataset, scale = 'free') +
ggpubr::stat_cor(
aes(color = sub.batches, label = ..r.label..),
method = "spearman" ,
label.x.npc = 0.02,
label.y.npc = .9,
hjust = 0,
size = 4,
r.accuracy = 0.1,
cor.coef.name = "rho"
) +
ggpubr::stat_cor(
aes( label = ..r.label..),
label.x.npc = 0.4,
method = "spearman",
label.y.npc = .98,
r.digits = 2,
p.digits = 2,
r.accuracy = 0.1,
hjust = 0,
size = 6,
col = 'black',
cor.coef.name = "rho"
) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
legend.position = "none",
axis.text.x = element_text(size = 10),
axis.text.y = element_text(size = 10),
axis.title.x = element_text(size = 12),
axis.title.y = element_text(size = 12),
strip.text.x = element_text(size = 18))Figure 7.18: Scatter plots exhibit the relationship between the E2F4 and CNOT1 gene expression in the FPKM.UQ (left), the RUV-III normalized data (middle) and the TCGA BRCA microarray data (right).
7.2.9 Gene expression differences paired metastatic-primary samples
Figure 7.19 shows the impact of flow cell chemistries on gene expression differences between paired primary and metastatic samples in the TCGA BRCA RNA-seq data. MA plot of three paired primary and metastatic samples profiled within (the first two plots of top row) and across flow cell chemistries (third plot of top row) in the TCGA FPKM.UQ data. Second row: same a first row, for the RUV-III normalized data. The solid black points in the MA plots represent the genes highly affected by flow cell chemistries.
Figure 7.19: MA plot of three paired primary and metastatic samples profiled within (the first two plots of top row) and across flow cell chemistries (third plot of top row) in the TCGA FPKM.UQ data. Second row: same a first row, for the RUV-III normalized data. The solid black points in the MA plots represent the genes highly affected by flow cell chemistries.
## [[1]]
## NULL
##
## [[2]]
## NULL
7.3 Library size effects
7.3.1 Association between PCs and library size
Figure 7.20 shows the association between the first 10 PCs and library size in the raw counts and differently normalized data.
pcs.ls.lnreg <- as.data.frame(lreg.ls.all) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III
) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'r.sq') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III')))
### Plots
ggplot(pcs.ls.lnreg, aes(x = pcs, y = r.sq, group = datasets)) +
geom_line(aes(color = datasets), size = 1) +
geom_point(aes(color = datasets), size = 3) +
xlab('PCs') +
ylab (expression("R"^"2")) +
scale_color_manual(
values = dataSets.colors,
labels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III'),
name = 'Datasets') +
scale_x_continuous(
breaks = (1:10),
labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(
breaks = scales::pretty_breaks(n = 5),
limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14))Figure 7.20: A plot showing the R-squared of linear regression between library size and up to the first 10 principal components (taken cumulatively) for different normalization methods.
Figure 7.20 shows the association between the first 10 PCs and library size in the raw counts and differently normalized data.
pam50.levels <- levels(
brca.cancer.se$pam50.geneFu.fpkmUq
)
pcs.ls.lnreg.pam50 <- lapply(
pam50.levels,
function(x){
r.seq <- lapply(
normalizations,
function(y){
lreg.ls.pam50Genefu.all[[x]][[y]]
})
names(r.seq) <- paste(normalizations, x, sep = '__')
do.call(cbind, r.seq)
})
names(pcs.ls.lnreg.pam50) <- pam50.levels
pcs.ls.lnreg.pam50 <- do.call(cbind, pcs.ls.lnreg.pam50) %>%
data.frame(.) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-c(pcs),
names_to = 'datasets',
values_to = 'r.sq') %>%
tidyr::separate(
col = datasets,
sep ='__',
into = c('datasets', 'pam50' )) %>%
data.frame(.) %>%
dplyr::mutate(
type = recode(
datasets,
'HTseq_counts' = 'Raw counts',
'HTseq_FPKM' = 'FPKM',
'HTseq_FPKM.UQ' ='FPKM.UQ',
'RUV_III' = 'RUV-III'))
### plots
ggplot(pcs.ls.lnreg.pam50 , aes(x = pcs, y = r.sq)) +
geom_line(aes(color = type)) +
geom_point(aes(color = type)) +
xlab('PCs') +
ylab (expression('R'^'2')) +
scale_color_manual(values = dataSets.colors, name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
facet_wrap(~pam50, ncol = 3) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10),
strip.text = element_text(size = 14))Figure 7.21: A plot showing the R-squared of linear regression between library size and up to the first 10 principal components (taken cumulatively) within each PAM50 subtypes for different normalization methods.
7.3.2 Correlation between genes and library size
Figure 7.22 shows the correlations between individual gene expression and library size in the raw counts and differently normalized data.
corr.geneLs.pam50 <- lapply(
pam50.levels,
function(x){
r.seq <- lapply(
c(1:4),
function(y){
corr.geneLs.pam50Genefu[[x]][[y]]$ls_rho
})
names(r.seq) <- paste(normalizations, x, sep = '__')
do.call(cbind, r.seq)
})
names(corr.geneLs.pam50) <- pam50.levels
corr.geneLs.pam50 <- do.call(cbind, corr.geneLs.pam50) %>%
data.frame(.) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'corr.coef') %>%
tidyr::separate(
col = datasets,
sep ='__',
into = c('datasets', 'pam50' )) %>%
data.frame(.) %>%
dplyr::mutate(type = recode(
datasets,
'HTseq_counts' = 'Raw counts',
'HTseq_FPKM' = 'FPKM',
'HTseq_FPKM.UQ' ='FPKM.UQ',
'RUV_III' = 'RUV-III')) %>%
dplyr::mutate( type = factor(
type,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III'))) %>%
data.frame(.)
ggplot(corr.geneLs.pam50, aes(x = pam50, y = corr.coef, fill = type)) +
geom_boxplot(outlier.color = 'white') +
ylab("Spearman correlation") +
xlab('PAM50 subtypes') +
scale_fill_manual(values = dataSets.colors, name = 'Datasets') +
guides(fill = guide_legend("Datasets")) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1.3),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 18),
axis.text.y = element_text(size = 18),
legend.text = element_text(size = 14),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))Figure 7.22: The boxplots of the Spearman correlation coefficients between the individual gene expression levels and library size within the PAM50 subtypes in differently normalized datasets
7.3.3 DE analysis between sample with low and high library size
Further, we evaluate the effects of library differences on the data using differential expression (DE) analyses between sample with low and high library size (2010 vs. 2011:2014). DE analyses were performed using the Wilcoxon signed-rank test with log2 transformed of the raw counts and normalized datasets. In the absence of any library size effects, the histogram of the resulting unadjusted p-values should be uniformly distributed (Figure 7.23).
de.ls.pam50Genefu <- lapply(
pam50.levels[1:4],
function(x){
pval <- lapply(
normalizations,
function(y){
de.ls.pam50Genefu.all[[x]][[y]]$pvalue
})
pval <- as.data.frame(do.call(cbind, pval))
colnames(pval) <- c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III')
pval <- pval %>%
tidyr::pivot_longer(
everything(),
names_to = 'Datasets',
values_to = 'Pvalues') %>%
dplyr::mutate(
Datasets = factor(
Datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III'))) %>%
data.frame()
ggplot(pval, aes(x = Pvalues)) +
geom_histogram(binwidth = 0.1) +
scale_x_continuous(breaks = c(seq(0,1,.5))) +
xlab('p_values') + ylab('Frequency') +
ggtitle(x) +
facet_wrap(~Datasets, ncol = 4) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
plot.title = element_text(size = 18),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))
})
names(de.ls.pam50Genefu) <- pam50.levels[1:4]
do.call(
gridExtra::grid.arrange,
de.ls.pam50Genefu)Figure 7.23: p-value histograms of differential expression analysis between samples with low and high library size within the four main PAM50 subtypes in the FPKM.UQ and the RUV-III normalized datasets.
7.4 PAM50 subtypes
Here, we evaluate the performance of different normalization methods in separating the PAM50 clusters.
7.4.1 PCA plots
Figure 7.24 shows PCA plots of the TCGA raw counts and different normalized dataset coloured by the PAM50 subtypes.
genefu.calls <- c(
'pam50.geneFu.raw',
'pam50.geneFu.fpkm',
'pam50.geneFu.fpkmUq',
'pam50Genefu.ruv')
pp <- lapply(
c(1:4),
function(x){
pcs <- pca.all[[x]]
p <- .scatter.density.pc(
pcs = pcs$sing.val$u[,1:3],
pc.var = pcs$var,
group.name = 'PAM50',
group = SummarizedExperiment::colData(brca.cancer.se)[ , genefu.calls[x]],
color = pam50.colors[2:6],
strokeSize = .2,
pointSize = 2,
strokeColor = 'gray30',
alpha = .6,
title = plot.names[x])
p
})
do.call(
gridExtra::grid.arrange,
c(pp[[1]],
pp[[2]],
pp[[3]],
pp[[4]],
ncol = 4))Figure 7.24: PCA plots of differently normalized data of the TCGA BRCA RNA-seq data coloured by PAM50 subtypes.
7.4.2 Vector correlation
Figure 7.25 shows the vector correlation coefficient between the PAM50 subtypes and the first 10 principal components for different normalization methods. Ideally, we should see high association between the PCs and the the PAM50 subtypes.
cca.pam50Genefu <- as.data.frame(cca.pam50Genefu.all) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III
) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'cca.corr') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III')))
### plot
ggplot(cca.pam50Genefu, aes(x = pcs, y = cca.corr, group = datasets)) +
geom_line(aes(color = datasets)) +
geom_point(aes(color = datasets)) +
xlab('PCs') + ylab ("Vector correlation") +
scale_color_manual(values = c(dataSets.colors), name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 14),
legend.title = element_text(size = 14))Figure 7.25: A plot showing the vector correlation coefficient between the PAM50 subtypes and up to the first 10 principal components
7.4.3 Silhouette coefficient and ARI index analyses
Figure 7.26 shows that the RUV-III normalization outperforms the other normalization in separating the PAM50 subtypes.
names(silCoef.pam50Genefu.all) <- normalizations.names
silCoef.pam50 <- as.data.frame(silCoef.pam50Genefu.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'silhou.coff' ) %>%
dplyr::mutate(datasets = factor(
normalizations.names,
levels = normalizations.names))
p.sil.pam50 <- ggplot(silCoef.pam50, aes(x = datasets, y = silhou.coff)) +
geom_col() +
ylab('Silhouette coefficient') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12))
names(ari.pam50Genefu.all) <- normalizations.names
ari.pam50 <- as.data.frame(ari.pam50Genefu.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'ari') %>%
dplyr::mutate(datasets = factor(
normalizations.names,
levels = normalizations.names))
p.ari.pam50 <- ggplot(ari.pam50, aes(x = datasets, y = ari)) +
geom_col() +
ylab('ARI') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12))
gridExtra::grid.arrange(
p.sil.pam50,
p.ari.pam50,
ncol = 2)Figure 7.26: Silhouette coefficients and ARI index for exhibiting the separation of the PAM50 subtypes in different datasets
7.4.4 Survival analysis of different PAM50 calls
The initial PAM50 subtypes and the ER estimates of the TCGA BRCA RNA-Seq data was kindly provided by Dr. K A. Hoadley. As we showed abovem, we implemented an algorithm proposed by Picornell, A.C (BMC Genomics, 2019) to recapitulate the process of obtaining the initial PAM50 subtypes. We used the ER estimates = 1.4 to divide samples to ER positive and negative groups and the calculate the calibration (median normalization) factors. This process was applied on the differently normalized TCGA BRCA RNA-seq data. In addition, we have also used molecular.subtyping() function with the PAM50 (single sample predictor) model from the genefu Rpackage to identify the PAM50 subtypes. This method performs Spearman correlation between each sample (only the PAM50 genes) and a PAM50 centroids (this data were downloaded here: https://github.com/bhklab/genefu) to calculate correlation coefficient for individual PAM50 subtypes. Each sample is assigned to a particular PAM50 subtype based its highest correlation coefficients. Here, we use Kaplan Meier survival analysis to assess the prognostic values of different PAM50 identification approaches. The results showed that the PAM50 subtypes obtained by the genefu method is slightly more prognostic compared to the other method (Figure 7.27.
cols <- c(
'Sample',
'age_at_initial_pathologic_diagnosis_liu',
'OS.time_liu',
'OS_liu',
'pam50.geneFu.raw',
'pam50.geneFu.fpkm',
'pam50.geneFu.fpkmUq',
'pam50.erBalanced.raw',
'pam50.erBalanced.FPKM',
'pam50.erBalanced.FPKM.UQ',
'pam50Genefu.ruv'
)
new.info <- SummarizedExperiment::colData(brca.cancer.se)[ , cols]
new.info <- new.info[ complete.cases(new.info) , ]
colnames(new.info)[5:ncol(new.info)] <- c(
'Raw counts (genefu)',
'FPKM (genefu)',
'FPKM.UQ (genefu)',
'Raw counts (ER balanced)',
'FPKM (ER balanced)',
'FPKM.UQ (ER balanced)',
'RUV-III (genefu)'
)
data.names <- colnames(new.info)[5:ncol(new.info)]
set.seed(2023091423)
pam50.survival <- lapply(
c(1:length(data.names)),
function(x){
df.annot <- new.info
index <- df.annot$OS.time_liu < 4000 &
df.annot[, data.names[x]] != 'Normal like'
pam50.survival.ruv <- survival_plot(
data = brca.rawCounts.cancer[ , index],
stratify = 'covariate',
annot = df.annot[index, ],
scoreCol = NULL,
covariate = data.names[x],
isCategoricalCov = TRUE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = NULL,
mainTitle1 = data.names[x],
confInt = FALSE,
ylabel = "Survival",
cols = c(
RColorBrewer::brewer.pal(9, "Set1")[c(2, 3, 4, 5, 7, 8)],
RColorBrewer::brewer.pal(8, "Dark2")[c(8, 1, 4, 6)]
),
nColLegend = 2,
plotType = "autoplot"
)$plot +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 10),
axis.title.y = element_text(size = 10),
plot.title = element_text(size = 8),
axis.text.x = element_text(size = 8),
axis.text.y = element_text(size = 8),
legend.text = element_text(size = 6),
legend.title = element_text(size = 8),
strip.text.x = element_text(size = 10)
)
})
cowplot::plot_grid(
plotlist = pam50.survival,
nrow = 3)Figure 7.27: Kaplan-Meier survival curves show the association between the PAM50 subtypes and overall survival in the TCGA BRCA RNA-seq data.
8 How robust RUV-III is to poorly chosen PRPS
Here, we assess the performance of RUV-III with poorly chosen PRPS on the TCGA BRCA data. To simulate poorly chosen PRPS, we randomly shuffle (20%, 40%, 60% and 80%) of the PAM50 subtypes that were originally used to create PRPS for RUV-III. The shuffling steps were repeated 10 times for each proportion and the results were averaged for normalization performance assessments.
8.1 PRPS with shuffling of the PAM50
Here, we randomly shuffle 20%, 40%, 60% and 80% of the PAM50 subtypes and create PRPS.
samples.to.use <- brca.sampleAnnot.cancer$pam50.consensus == 1
brca.sampleAnnot.prps <- droplevels(brca.sampleAnnot.cancer[ samples.to.use, ])
brca.rawCounts.prps <- brca.rawCounts.cancer[ , samples.to.use]
set.seed(703221215)
prps.random <- lapply(
c(seq(.2, .8, .2)),
function(x){
prps <- lapply(
c(1:10),
function(y){
a <- sample(
x = 1:nrow(brca.sampleAnnot.prps),
size = round(x = x*nrow(brca.sampleAnnot.prps), digits = 0))
b <- sample(
x = 1:nrow(brca.sampleAnnot.prps),
size = round(x = x*nrow(brca.sampleAnnot.prps), digits = 0))
brca.sampleAnnot.prps$pam50.geneFu.fpkmUq.2 <-
brca.sampleAnnot.prps$pam50.geneFu.fpkmUq
brca.sampleAnnot.prps$pam50.geneFu.fpkmUq.2[a] <-
brca.sampleAnnot.prps$pam50.geneFu.fpkmUq[b]
prps.brca.random <- .CreatePseudoSamplesForLsPurityBatch(
expr.data = brca.rawCounts.prps,
sample.info = droplevels(brca.sampleAnnot.prps),
librarySize = 'libSize',
batch = 'PlateId_mda',
biology = 'pam50.geneFu.fpkmUq.2',
purity = 'purity_HTseq_FPKM',
include.ls = TRUE,
include.purity = TRUE,
minSamplesPerBatchPS = 3,
minSamplesForPurityPerBiology = 12,
minSamplesForPurityPS = 3,
minSamplesForLibrarySizePerBatch = 12,
minSamplesForLibrarySizePS = 3)
})
names(prps) <- paste0('rep', 1:10)
prps
})
names(prps.random) <- paste0(
"shuffle",
seq(.2, .8, .2)
)
prps.random.data <- lapply(
c(1:4),
function(y){
prps <- lapply(
c(1:10),
function(x){
prps.batch <- prps.random[[y]][[x]]$ps.batch
colnames(prps.batch) <-
lapply(
colnames(prps.batch),
function(x){
unlist(strsplit(x, '[_]'))[1]})
prps.purity <- prps.random[[y]][[x]]$ps.purity
prps.ls <- prps.random[[y]][[x]]$ps.ls
prps.all <- cbind(prps.batch, prps.purity, prps.ls)
prps.all
})
names(prps) <- paste0('rep', 1:10)
prps
})
names(prps.random.data) <- names(prps.random)8.2 RUV-III normalization
We apply RUV-III with different sets of poorly chosen PRPS.
ruv.data <- t(log2(brca.rawCounts.cancer + 1))
ncg <- colnames(ruv.data) %in% all.ncg.sets[[6]]
ruviii.prps.random <- lapply(
c(1:4),
function(x){
ruv <- lapply(
c(1:10),
function(y){
prps.data <- prps.random.data[[x]][[y]]
rep.matrix <- ruv::replicate.matrix(colnames(prps.data))
ruviii.norm.random <- .fastRUV_III_PRPS(
Y = ruv.data,
Yrep = t(log2(prps.data + 1)),
M = rep.matrix,
ctl = ncg,
k = 2,
eta = 1,
return.info = F)
ruviii.prps.random <- t(ruviii.norm.random[1:ncol(brca.rawCounts.cancer), ])
ruviii.prps.random
})
names(ruv) <- paste0('rep', 1:10)
ruv
})
names(ruviii.prps.random) <- names(prps.random)8.3 Normalizations performance assessments
Here, we use several performance assessment metrics to evaluet how robust the RUV-III is to poorly chosen PRPS.
8.4 PAM50 subtypes
8.4.1 Vector correlation
Figure 8.1 shows the vector correlation coefficient between the PAM50 subtypes and the first 10 principal components for different normalization methods. The results show that the RUV-III with up to 40% shuffling of the PAM50 subtypes shows better performance compared to the TCGA normalizations in separating the PAM50 subtypes.
pca.ruv.random <- lapply(
c(1:4),
function(x){
pcs <- lapply(
c(1:10),
function(y){
fast.pca(
data = ruviii.prps.random[[x]][[y]],
is.log = TRUE,
nPcs = 10)
})
names(pcs) <- paste0('rep', 1:10)
pcs
})
names(pca.ruv.random) <- names(prps.random)
###
pam50.dummies <- fastDummies::dummy_cols(brca.cancer.se$pam50Genefu.ruv)
pam50.dummies <- pam50.dummies[, c(2:ncol(pam50.dummies))]
cca.pam50.ruv.random <- lapply(
c(1:4),
function(x){
cca.all <- lapply(
c(1:10),
function(y){
pcs <- pca.ruv.random[[x]][[y]]
cms.caa <- sapply(
1:10,
function(z) {
cms.caa <-
stats::cancor(
x = pcs$sing.val$u[, 1:z, drop = FALSE],
y = pam50.dummies)
1 - prod(1 - cms.caa$cor ^ 2)
})
})
names(cca.all) <- paste0('rep', 1:10)
cca.all
})
cca.pam50.ruv.random <- lapply(
c(1:4),
function(x){
rowMeans(do.call(cbind, cca.pam50.ruv.random[[x]]))
})
names(cca.pam50.ruv.random) <- paste0(
'RUV_III_',
seq(.2,.8, .2))
cca.pam50.ruv.random <- do.call(
c,
list(
cca.pam50Genefu.all,
cca.pam50.ruv.random))
pcs.cms.cca <- as.data.frame(cca.pam50.ruv.random) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III,
'RUV-III-0.2' = RUV_III_0.2,
'RUV-III-0.4' = RUV_III_0.4,
'RUV-III-0.6' = RUV_III_0.6,
'RUV-III-0.8' = RUV_III_0.8) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'vec.corr') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8'))) %>%
data.frame(.)
# Plot
ggplot(pcs.cms.cca, aes(x = pcs, y = vec.corr, group = datasets)) +
geom_line(aes(color = datasets), size = .5) +
geom_point(aes(color = datasets), size = 2) +
xlab('PCs') +
ylab (expression("Vector correlation")) +
scale_color_manual(
values=c(dataSets.colors.2),
labels = names(dataSets.colors.2),
name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
axis.text.x = element_text(size = 12, angle = 25, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14))Figure 8.1: A plot showing the vector correlation coefficient between the PAM50 subtypes and up to the first 10 principal components.
8.4.2 Silhouette coefficient ann ARI index analyses
Silhouette coefficient ann ARI index analyses (Figure 8.2) show that the RUV-III with up to 40% shuffling of the PAM50 subtypes shows better performance compared to the TCGA normalizations in separating the PAM50 subtypes.
silCoef.pam50.ruv.random <- lapply(
c(1:4),
function(x){
sil.coef <- lapply(
c(1:10),
function(y){
pcs <- pca.ruv.random[[x]][[y]]
.silhouette.coeff(
pcs = pcs$sing.val$u,
variable = brca.cancer.se$pam50Genefu.ruv,
nPCs = 3)
})
names(sil.coef) <- paste0('rep', 1:10)
sil.coef})
silCoef.pam50.ruv.random <- lapply(
c(1:4),
function(x){
mean(unlist(silCoef.pam50.ruv.random[[x]]))
})
names(silCoef.pam50.ruv.random) <- paste0(
'RUV_III_',
seq(.2,.8, .2)
)
silCoef.pam50.ruv.random <- do.call(
c,
c(silCoef.pam50Genefu.all,
silCoef.pam50.ruv.random)
)
pcs.cms.silCoef <- as.data.frame(silCoef.pam50.ruv.random) %>%
tidyr::pivot_longer(
everything(),
names_to = 'silCoef.cms',
values_to = 'silCoef') %>%
dplyr::mutate(datasets = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')) %>%
dplyr::mutate(datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')))
p1 <- ggplot(pcs.cms.silCoef, aes(x = datasets, y = silCoef)) +
geom_col() +
ylab('Silhouette coefficient') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))
# ARI
set.seed(2011110837)
ari.pam50.ruv.random <- lapply(
c(1:4),
function(x){
ari <- lapply(
c(1:10),
function(y){
pcs <- pca.ruv.random[[x]][[y]]$sing.val$u[, 1:3]
BIC <- mclust::mclustBIC(data = pcs)
mod <- mclust::Mclust(data = pcs, x = BIC, G = 4)
mclust::adjustedRandIndex(
mod$classification,
brca.cancer.se$pam50Genefu.ruv)
})
names(ari) <- paste0('rep', 1:10)
ari
})
ari.pam50.ruv.random <- lapply(
c(1:4),
function(x){
mean(unlist(ari.pam50.ruv.random[[x]]))
})
names(ari.pam50.ruv.random) <- paste0(
'RUV_III_',
seq(.2,.8, .2)
)
ari.pam50.ruv.random <- do.call(
c,
c(ari.pam50Genefu.all,
ari.pam50.ruv.random)
)
pcs.cms.ari <- as.data.frame(ari.pam50.ruv.random) %>%
tidyr::pivot_longer(
everything(),
names_to = 'silCoef.cms',
values_to = 'ari') %>%
dplyr::mutate(datasets = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')) %>%
dplyr::mutate(datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')))
# Plot
p2 <- ggplot(pcs.cms.ari, aes(x = datasets, y = ari)) +
geom_col() +
ylab('ARI') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 12,
angle = 45,
vjust = 1,
hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))
gridExtra::grid.arrange(
p1, p2, ncol = 2)Figure 8.2: Silhouette coefficients and ARI index for separating the PAm50 subtypes.
8.5 Tumour purity variation
Here, we assess the performance of RUV-III with poorly chosen PRPS in removing tumour purity.
8.5.1 Association between PCs and tumour purity variation
Linear regression between the first 10 PCs and tumour purity variation (Figure 8.3) shows that the RUV-III with up to 40% shuffling of the PAM50 subtypes shows better performance compared to the TCGA normalizations in removing the variation of tumor purity in the data.
lreg.pcs.purity.ruv.random <- lapply(
c(1:4),
function(x){
rSquared <- lapply(
c(1:10),
function(y){
pcs <- pca.ruv.random[[x]][[y]]$sing.val$u
ls.rSquared <- sapply(
1:10,
function(z) {
lm.ls <- summary(lm(brca.sampleAnnot.cancer$purity_HTseq_FPKM.UQ ~ pcs[, 1:z]))$r.squared
})
})
names(rSquared) <- paste0('rep', 1:10)
rSquared
})
names(lreg.pcs.purity.ruv.random) <- names(prps.random.data)
lreg.pcs.ls.ruv.random <- lapply(
c(1:4),
function(x){
rowMeans(do.call(cbind, lreg.pcs.purity.ruv.random[[x]]))
})
names(lreg.pcs.ls.ruv.random) <- paste0(
'RUV_III_',
seq(.2,.8, .2)
)
lreg.pcs.purity.ruv.random <- do.call(
c,
list(
lreg.purity.all,
lreg.pcs.ls.ruv.random)
)
pcs.ls.lnreg <- as.data.frame(lreg.pcs.purity.ruv.random) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III,
'RUV-III-0.2' = RUV_III_0.2,
'RUV-III-0.4' = RUV_III_0.4,
'RUV-III-0.6' = RUV_III_0.6,
'RUV-III-0.8' = RUV_III_0.8) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'r.sq') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')))
ggplot(pcs.ls.lnreg, aes(x = pcs, y = r.sq, group = datasets)) +
geom_line(aes(color = datasets), size = .5) +
geom_point(aes(color = datasets), size = 2) +
xlab('PCs') + ylab (expression("R"^"2")) +
scale_color_manual(
values = c(dataSets.colors.2),
name = 'Datasets',
labels = names(dataSets.colors.2)) +
scale_x_continuous(
breaks = (1:10),
labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(
breaks = scales::pretty_breaks(n = 5),
limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
axis.text.x = element_text(size = 12, angle = 35, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))Figure 8.3: A plot showing the R-squared of linear regression between tumour purity and up to the first 10 principal components (taken cumulatively) for different normalization methods.
8.6 Flow cell chemistry
8.6.1 Association between PCs and flow cell chemistry
Figure 8.4 shows that the RUV-III with poorly chosen PRPS outperforms the TCGA normalizations in removing flow cell chemistry effects from the data.
fcch.dummies <- fastDummies::dummy_cols(brca.cancer.se$FcCh)
fcch.dummies <- fcch.dummies[, c(2:ncol(fcch.dummies))]
cca.fcch.ruv.random <- lapply(
c(1:4),
function(x){
cca.all <- lapply(
c(1:10),
function(y){
pcs <- pca.ruv.random[[x]][[y]]
cms.caa <- sapply(
1:10,
function(z) {
cms.caa <-
stats::cancor(
x = pcs$sing.val$u[, 1:z, drop = FALSE],
y = fcch.dummies)
1 - prod(1 - cms.caa$cor ^ 2)
})
})
names(cca.all) <- paste0('rep', 1:10)
cca.all
})
cca.fcch.ruv.random <- lapply(
c(1:4),
function(x){
rowMeans(do.call(cbind, cca.fcch.ruv.random[[x]]))
})
names(cca.fcch.ruv.random) <- paste0(
'RUV_III_',
seq(.2,.8, .2))
cca.fcch.ruv.random <- do.call(
c,
list(
cca.fcch.all,
cca.fcch.ruv.random))
pcs.cms.cca <- as.data.frame(cca.fcch.ruv.random) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III,
'RUV-III-0.2' = RUV_III_0.2,
'RUV-III-0.4' = RUV_III_0.4,
'RUV-III-0.6' = RUV_III_0.6,
'RUV-III-0.8' = RUV_III_0.8) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'vec.corr') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8'))) %>%
data.frame(.)
# Plot
ggplot(pcs.cms.cca, aes(x = pcs, y = vec.corr, group = datasets)) +
geom_line(aes(color = datasets), size = .5) +
geom_point(aes(color = datasets), size = 2) +
xlab('PCs') +
ylab (expression("Vector correlation")) +
scale_color_manual(
values=c(dataSets.colors.2),
labels = names(dataSets.colors.2),
name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
axis.text.x = element_text(size = 12, angle = 35, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14))Figure 8.4: A plot showing the vector correlation coefficient between the flow cell chemistry and up to the first 10 principal components.
8.6.2 Silhouette coefficient ann ARI index analyses
Silhouette coefficient and ARI index analyses (Figure 8.5) show that the RUV-III with poorly chosen PRPS shows better performance compared to the TCGA normalizations in removing the effects of flow cell chemistries.
silCoef.fcch.ruv.random <- lapply(
c(1:4),
function(x){
sil.coef <- lapply(
c(1:10),
function(y){
pcs <- pca.ruv.random[[x]][[y]]
.silhouette.coeff(
pcs = pcs$sing.val$u,
variable = brca.sampleAnnot.cancer$FcCh,
nPCs = 3)})
names(sil.coef) <- paste0('rep', 1:10)
sil.coef})
silCoef.fcch.ruv.random <- lapply(
c(1:4),
function(x){
mean(unlist(silCoef.fcch.ruv.random[[x]]))
})
names(silCoef.fcch.ruv.random) <- paste0(
'RUV_III_',
seq(.2,.8, .2))
silCoef.fcch.ruv.random <- do.call(
c,
c(silCoef.fcch.all,
silCoef.fcch.ruv.random))
pcs.cms.silCoef <- as.data.frame(silCoef.fcch.ruv.random) %>%
tidyr::pivot_longer(
everything(),
names_to = 'silCoef.cms',
values_to = 'silCoef') %>%
dplyr::mutate(datasets = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')) %>%
dplyr::mutate(datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')))
p1 <- ggplot(pcs.cms.silCoef, aes(x = datasets, y = silCoef)) +
geom_col() +
ylab('Silhouette coefficient') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 12))
# ARI
set.seed(2011110837)
ari.fcch.ruv.random <- lapply(
c(1:4),
function(x){
ari <- lapply(
c(1:10),
function(y){
pcs <- pca.ruv.random[[x]][[y]]$sing.val$u[, 1:3]
BIC <- mclust::mclustBIC(data = pcs)
mod <- mclust::Mclust(data = pcs, x = BIC, G = 2)
mclust::adjustedRandIndex(
mod$classification,
brca.sampleAnnot.cancer$FcCh)
})
names(ari) <- paste0('rep', 1:10)
ari
})
ari.fcch.ruv.random <- lapply(
c(1:4),
function(x){
mean(unlist(ari.fcch.ruv.random[[x]]))
})
names(ari.fcch.ruv.random) <- paste0(
'RUV_III_',
seq(.2,.8, .2)
)
ari.fcch.ruv.random <- do.call(
c,
c(ari.fcch.all,
ari.fcch.ruv.random)
)
pcs.cms.ari <- as.data.frame(ari.fcch.ruv.random) %>%
tidyr::pivot_longer(
everything(),
names_to = 'silCoef.cms',
values_to = 'ari') %>%
dplyr::mutate(datasets = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')) %>%
dplyr::mutate(datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')))
# Plot
p2 <- ggplot(pcs.cms.ari, aes(x = datasets, y = ari)) +
geom_col() +
ylab('ARI') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 12,
angle = 45,
vjust = 1,
hjust = 1),
axis.text.y = element_text(size = 12))
gridExtra::grid.arrange(p1, p2, ncol = 2)Figure 8.5: Silhouette coefficients and ARI index for mixing samples from two different flow cell chemistries.
8.7 Known co-expressed genes
It has been reported that the gene expression of CCNB1 is highly correlated in CENPE,AURKB, PLK1 and PLK4 genes in breast cancer Markus Ringnér et.al. Here, we explore these correlation in different normalized data of the TCGA BRCA RNA-seq data. Figure 8.6 shows the those correlations are preserved in the RUV-III with poorly chosen PRPS.
### correlation in the RUV-III with poorly chosen PRPS
genes <- c(
'CENPE',
'AURKB',
'PLK1',
'PLK4' )
gene.corr.random <- lapply(
genes,
function(x){
corr <- lapply(
c(1:4),
function(y){
lapply(
c(1:10),
function(z){
data <- ruviii.prps.random[[y]][[z]]
cor.test(data['CCNB1' , ], data[x , ], method = 'spearman')[[4]]
})
})
names(corr) <- names(prps.random)
corr
})
names(gene.corr.random) <- genes
mean.gene.corr.random <- lapply(
genes,
function(x){
lapply(
names(prps.random),
function(y){
mean(unlist(gene.corr.random[[x]][[y]]))
})
})
names(mean.gene.corr.random) <- genes
### correlation in the brca.cancer.se
gene.corr <- lapply(
genes,
function(x){
lapply(
normalizations,
function(y){
data <- SummarizedExperiment::assay(brca.cancer.se, y)
cor.test(data['CCNB1' , ], data [x , ], method = 'spearman')[[4]]
})
})
names(gene.corr) <- genes
gene.corr.all <- lapply(
names(gene.corr),
function(x){
corrs <- c(unlist(gene.corr[[x]]), unlist(mean.gene.corr.random[[x]]))
names(corrs) <-
c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8')
corrs
})
names(gene.corr.all) <- names(gene.corr)
genes <- data.frame(gene.corr.all) %>%
dplyr::mutate(datasets = row.names(.)) %>%
tidyr::pivot_longer(-datasets, names_to = 'genes', values_to = 'corr') %>%
dplyr::mutate(datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-0.2',
'RUV-III-0.4',
'RUV-III-0.6',
'RUV-III-0.8'))) %>%
data.frame()
ggplot(genes, aes(x = datasets, y = corr)) +
geom_point() +
facet_wrap(~genes) +
ylab('Spearman correlation') +
xlab('')+
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 10,
angle = 45,
vjust = 1,
hjust = 1),
axis.text.y = element_text(size = 10),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 15))Figure 8.6: Spearman correlations between CCNB1 and CENPE,AURKB, PLK1 and PLK4 genes in different normalized data of the TCGA BRCA RNA-seq data
8.8 Gene-gene correlation
Here, we explore the correlation between two pairs of genes including CNOT1_E2F4 and ESRRA_MAP3K2 that were discussed in the TCGA BRCA RNA-seq data.
pair.genes <- list(
pair1 = c('CNOT1', 'E2F4'),
pair2 = c('ESRRA', 'MAP3K2'))
g.g.corr_1 <- lapply(
names(pair.genes),
function(x){
corr.coef <- lapply(
normalizations,
function(y){
data <- SummarizedExperiment::assay(brca.cancer.se, y)
cor.test(
data[pair.genes[[x]][1] , ],
data[pair.genes[[x]][2] , ],
method = 'spearman')[[4]]
})
names(corr.coef) <- normalizations.names
corr.coef
})
names(g.g.corr_1) <- names(pair.genes)
g.g.corr_2 <- lapply(
names(pair.genes),
function(g){
corr <- lapply(
c(1:4),
function(x){
corr.all <- lapply(
c(1:10),
function(y){
data <- ruviii.prps.random[[x]][[y]]
cor.test(
data[pair.genes[[g]][1] ,],
data[pair.genes[[g]][2] ,],
method = 'spearman')[[4]]})
mean(unlist(corr.all))
})
names(corr) <- paste0(
'RUV-III.',
seq(.2, .8, .2))
corr
})
names(g.g.corr_2) <- names(pair.genes)
g.g.corr <- lapply(
names(pair.genes),
function(x){
c(unlist(g.g.corr_1[[x]]), unlist(g.g.corr_2[[x]]))
})
names(g.g.corr) <- c('CNOT1_E2F4', 'ESRRA_MAP3K2')
g.g.corr <- as.data.frame(g.g.corr) %>%
dplyr::mutate(datasets = factor(row.names(.), row.names(.))) %>%
dplyr::mutate(datasets = gsub('.rho', '', datasets)) %>%
tidyr::pivot_longer(-datasets, names_to = 'genes', values_to = 'corr') %>%
dplyr::mutate(datasets = factor(datasets, levels = c('Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III.0.2',
'RUV-III.0.4',
'RUV-III.0.6',
'RUV-III.0.8')))
ggplot(g.g.corr, aes(x = datasets, y = corr)) +
geom_point() +
xlab('') +
ylab('Spearman correlation') +
facet_wrap(~genes, scale = 'free') +
theme_bw() +
theme(
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 12,
angle = 45,
vjust = 1,
hjust = 1
),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))Figure 8.7: Plots show the Spearman correlation between two pair of genes in differently normalized data.
8.8.1 Gene expression and survival
selected.genes <- c(
'ESRRA',
'ZEB2')
pval <- list()
for(i in selected.genes){
sur.gene <- lapply(
normalizations,
function(x){
p <- survival_plot(
data = as.data.frame(SummarizedExperiment::assay(brca.cancer.se, x)),
stratify = 'expr',
annot = as.data.frame(SummarizedExperiment::colData(brca.cancer.se)),
scoreCol = NULL,
gene = i,
covariate = NULL,
isCategoricalCov = FALSE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = 2,
confInt = FALSE,
mainTitle1 = i,
ylabel = "Survival",
cols = c(
brewer.pal(9, "Set1")[c(2, 3, 4, 5, 7, 8)],
brewer.pal(8, "Dark2")[c(8, 1, 4, 6)]),
nColLegend = 1,
plotType = "autoplot")
p$pval
})
pval[[i]] <- sur.gene
}
##
ESRRA.surv <- lapply(
c(1:4),
function(x){
all.pval <- lapply(
c(1:10),
function(y){
p <- survival_plot(
data = ruviii.prps.random[[x]][[y]],
stratify = 'expr',
annot = brca.sampleAnnot.cancer,
scoreCol = NULL,
gene = 'ESRRA',
covariate = NULL,
isCategoricalCov = FALSE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = 2,
confInt = FALSE,
mainTitle1 = i,
ylabel = "Survival",
nColLegend = 1,
plotType = "autoplot"
)
p$pval
})
names(all.pval) <- paste0('rep', 1:10)
all.pval
})
mean.pval.ESRRA.surv <- unlist(lapply(
c(1:4),
function(x) mean(unlist(ESRRA.surv[[x]]))
))
names(mean.pval.ESRRA.surv) <- c(paste0('RUV-III.', seq(.2,.8, .2)))
ESRRA.surv <- unlist(pval$ESRRA)
names(ESRRA.surv) <- c('Raw counts', 'FPKM', 'FPKM.UQ', 'RUV-III')
ESRRA.surv <- as.data.frame(
do.call(
c,
list(ESRRA.surv, mean.pval.ESRRA.surv)))
colnames(ESRRA.surv) <- 'pval'
ESRRA.surv$dataset <- factor(row.names(ESRRA.surv), levels = row.names(ESRRA.surv))
ggplot(ESRRA.surv, aes(x = dataset, y = pval)) +
geom_point() +
xlab('') +
ylab('P-value') +
theme_bw() +
ggtitle('ESRRA') +
geom_hline(yintercept = 0.05, col = 'black', linetype = 'dotted') +
theme(
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 12,
angle = 45,
vjust = 1,
hjust = 1
),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))Figure 8.8: Association between gene expression and overall survival in the raw data and differently normalized datasets of the TCGA BRCA RNA-seq data
9 RUV-III with partially known biological labels for PRPS
We assess the performance of RUV-III with PRPS in situations where the biological labels are partially known (hereafter called the RUV-III-P). To simulate such situations, we only used the Basal and LumA of the PAM50 subtypes to create RRPS for RUV-III normalization of the TCGA BRCA RNA-seq data.
samples.to.use <- brca.sampleAnnot.cancer$pam50.consensus == 1 &
brca.sampleAnnot.cancer$pam50.geneFu.fpkmUq =='Basal' |
brca.sampleAnnot.cancer$pam50.geneFu.fpkmUq =='LumA'
prps.brca.partial <- .CreatePseudoSamplesForLsPurityBatch(
expr.data = brca.rawCounts.cancer[ , samples.to.use],
sample.info = droplevels(brca.sampleAnnot.cancer[samples.to.use , ]),
librarySize = 'libSize',
batch = 'PlateId_mda',
biology ='pam50.geneFu.fpkmUq',
purity = 'purity_HTseq_FPKM',
include.ls = TRUE,
include.purity = TRUE,
minSamplesPerBatchPS = 3,
minSamplesForPurityPerBiology = 12,
minSamplesForPurityPS = 3,
minSamplesForLibrarySizePerBatch = 12,
minSamplesForLibrarySizePS = 3)9.1 RUV-III-Normaliazation
We apply RUV-III normalization on the TCGA BRCA RNA-seq data with the PRPS that were generated using the Basal and LumA subtypes.
colnames(prps.brca.partial$ps.batch) <- unlist(lapply(
colnames(prps.brca.partial$ps.batch),
function(x){
unlist(strsplit(x, '[_]'))[1]
}))
brca.ruv.input <- t(log2(cbind(
brca.rawCounts.cancer,
prps.brca.partial$ps.ls,
prps.brca.partial$ps.batch,
prps.brca.partial$ps.purity
) + 1)) # 1252 16537
## replicate matrix
rep.matrix.ruv <- ruv::replicate.matrix(
row.names(brca.ruv.input)
) # 1252 1119
## ruv-iii normalization
ruviii.norm.partial <- RUV_III_PRPS(
Y = brca.ruv.input,
M = rep.matrix.ruv,
ctl = colnames(brca.ruv.input) %in% all.ncg.sets[[6]],
k = 12,
eta = NULL,
return.info = TRUE
)
ruviii.prps.par <- t(ruviii.norm.partial$newY[1:1086, ])9.2 Normalization performance assessments
We create a new SummarizedExperiment object with all the datasets.
brca.cancer.se <- SummarizedExperiment::SummarizedExperiment(
assays = list(
HTseq_counts = SummarizedExperiment::assay(
brca.cancer.se,
'HTseq_counts'),
HTseq_FPKM = SummarizedExperiment::assay(
brca.cancer.se,
'HTseq_FPKM'),
HTseq_FPKM.UQ = SummarizedExperiment::assay(
brca.cancer.se,
'HTseq_FPKM.UQ'),
RUV_III = ruviii.prps.norm,
RUV_III_p = ruviii.prps.par
),
colData = droplevels(S4Vectors::DataFrame(
SummarizedExperiment::colData(brca.se[ , index.cancer]))),
rowData = as.data.frame(
SummarizedExperiment::rowData(brca.se))
)
normalizations <- names(
SummarizedExperiment::assays(brca.cancer.se)
)
normalizations.names <- c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-P')9.2.1 PAM50 subtypes
9.2.1.1 Association between PCs and the PAM50 subtypes
Figure 9.1 shows that performance of the RUV-III-P normalization is similar to the original RUV-III in separating the PAM50 subtypes.
### PCA
pca.all <- lapply(
normalizations,
function(x){
fast.pca(
data = as.matrix(
SummarizedExperiment::assay(brca.cancer.se, x)
),
is.log = TRUE,
nPcs = 10)
})
names(pca.all) <- normalizations
genefu.calls <- c(
'pam50.geneFu.raw',
'pam50.geneFu.fpkm',
'pam50.geneFu.fpkmUq',
'pam50Genefu.ruv',
'pam50Genefu.ruv')
cca.pam50Genefu.all <- lapply(
c(1:5),
function(x){
pcs <- pca.all[[x]]$sing.val$u
pam50Genefu.dummies <- fastDummies::dummy_cols(
SummarizedExperiment::colData(brca.cancer.se)[, genefu.calls[x] ]
)
pam50Genefu.dummies <- pam50Genefu.dummies[, c(2:ncol(pam50Genefu.dummies))]
sapply(
1:10,
function(y){
cca.pam50 <- stats::cancor(
x = pcs[, 1:y, drop = FALSE],
y = pam50Genefu.dummies)
1 - prod(1 - cca.pam50$cor^2)
})
})
names(cca.pam50Genefu.all) <- normalizations
cca.pam50Genefu <- as.data.frame(cca.pam50Genefu.all) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III,
'RUV-III-P' = RUV_III_p
) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'cca.corr') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-P')))
### plot
ggplot(cca.pam50Genefu, aes(x = pcs, y = cca.corr, group = datasets)) +
geom_line(aes(color = datasets)) +
geom_point(aes(color = datasets)) +
xlab('') +
ylab ("Vector correlation") +
scale_color_manual(values = c(dataSets.colors.3), name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
axis.text.y = element_text(size = 12))Figure 9.1: A plot showing the vector correlation coefficient between the PAm50 subtypes and up to the first 10 principal components
9.2.1.2 Silhouette coefficient ann ARI index analyses
Silhouette coefficient and ARI index analyses (Figure 9.2) show that the RUV-III-P results in a satisfactory normalization by preserving the PAM50 variation.
silCoef.pam50Genefu.all <- lapply(
c(1:5),
function(x){
.silhouette.coeff(
pcs = pca.all[[x]]$sing.val$u,
variable = SummarizedExperiment::colData(
brca.cancer.se)[ , genefu.calls[x]],
nPCs = 3)
})
names(silCoef.pam50Genefu.all) <- normalizations
silCoef.pam50 <- as.data.frame(silCoef.pam50Genefu.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'silhou.coff' ) %>%
dplyr::mutate(datasets = factor(
normalizations.names,
levels = normalizations.names))
p.sil.pam50 <- ggplot(silCoef.pam50, aes(x = datasets, y = silhou.coff)) +
geom_col() +
ylab('Silhouette coefficient') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))
### ARI
set.seed(2012190737)
ari.pam50Genefu.all <- lapply(
c(1:5),
function(x){
pcs <- pca.all[[x]]$sing.val$u[,1:3]
BIC <- mclust::mclustBIC(data = pcs)
mod <- mclust::Mclust(data = pcs, x = BIC, G = 5)
mclust::adjustedRandIndex(
mod$classification,
SummarizedExperiment::colData(brca.cancer.se)[, genefu.calls[x]]
)
})
names(ari.pam50Genefu.all) <- normalizations
ari.pam50Genefu <- as.data.frame(ari.pam50Genefu.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'ari' ) %>%
dplyr::mutate(datasets = factor(
normalizations.names,
levels = normalizations.names))
p.ari.pam50 <- ggplot(ari.pam50Genefu, aes(x = datasets, y = ari)) +
geom_col() +
ylab('ARI') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10)
)
gridExtra::grid.arrange(p.sil.pam50, p.ari.pam50, ncol = 2)Figure 9.2: Silhouette coefficients and ARI index for separating the PAM50 subtypes.
9.2.2 Flow cell chemistry
9.2.2.1 Association between PCs and flow cell chemistry
Figure 9.3 shows that the RUV-III-P removes the flow cell chemistry effects from the data.
fcch.dummies <- fastDummies::dummy_cols(brca.cancer.se$FcCh)
fcch.dummies <- fcch.dummies[, c(2:ncol(fcch.dummies))]
cca.fcch.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u
sapply(
1:10,
function(y){
cca.fcch <- stats::cancor(
x = pcs[, 1:y, drop = FALSE],
y = fcch.dummies)
1 - prod(1 - cca.fcch$cor^2)
})
})
names(cca.fcch.all) <- normalizations
pcs.fcch.cca <- as.data.frame(cca.fcch.all) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III,
'RUV-III-P' = RUV_III_p
) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'cca.coef') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-P')))
## plot
ggplot(pcs.fcch.cca, aes(x = pcs, y = cca.coef, group = datasets)) +
geom_line(aes(color = datasets)) +
geom_point(aes(color = datasets)) +
xlab('PCs') + ylab ("Vector correlation") +
scale_color_manual(values = c(dataSets.colors.3), name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 12, angle = 45, hjust = 1),
axis.text.y = element_text(size = 12))Figure 9.3: A plot showing the vector correlation coefficient between flow cell chemistry subtypes and up to the first 10 principal components
9.2.2.2 Silhouette coefficient ann ARI index analyses
Silhouette coefficient and ARI index analyses (Figure 9.4) show that the RUV-III-P results in a satisfactory normalization by removing the flow cell chemistry from the data.
silCoef.fcch.all <- lapply(
normalizations,
function(x){
.silhouette.coeff(
pcs = pca.all[[x]]$sing.val$u,
variable = brca.cancer.se$FcCh,
nPCs = 3)
})
names(silCoef.fcch.all) <- normalizations
silCoef.fcch <- as.data.frame(silCoef.fcch.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'ari') %>%
dplyr::mutate(datasets = factor(
normalizations.names,
levels = normalizations.names))
p.sil.fcch <- ggplot(silCoef.fcch, aes(x = datasets, y = ari)) +
geom_col() +
ylab('Silhouette coefficient') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12))
### Ari
set.seed(2012190737)
ari.fcch.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u[,1:3]
BIC <- mclust::mclustBIC(data = pcs)
mod <- mclust::Mclust(data = pcs, x = BIC, G = 2)
mclust::adjustedRandIndex(
mod$classification,
brca.cancer.se$FcCh)
})
names(ari.fcch.all) <- normalizations
ari.fcch <- as.data.frame(ari.fcch.all) %>%
tidyr::pivot_longer(
everything(),
names_to = 'datasets',
values_to = 'ari') %>%
dplyr::mutate(datasets = factor(
normalizations.names,
levels = normalizations.names))
p.ari.fcch <- ggplot(ari.fcch, aes(x = datasets, y = ari)) +
geom_col() +
ylab('ARI') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18),
plot.title = element_text(size = 15),
axis.text.x = element_text(size = 12, angle = 25, hjust = 1),
axis.text.y = element_text(size = 12))
gridExtra::grid.arrange(
p.sil.fcch,
p.ari.fcch,
ncol = 2)Figure 9.4: Silhouette coefficients and ARI index for mixing samples from two flow cell chemistry batches.
9.2.3 Tumour purity variation
9.2.3.1 Association between PCs and tumour purity variation
Figure 9.5 shows the RUV-III-P reduces the variation of tumor purity in the data.
lreg.purity.all <- lapply(
normalizations,
function(x){
pcs <- pca.all[[x]]$sing.val$u
ls.rSquared <- sapply(
1:10,
function(y) {
lm.ls <- summary(lm(brca.cancer.se$purity_HTseq_FPKM.UQ ~ pcs[, 1:y]))$r.squared
})
})
names(lreg.purity.all) <- normalizations
lreg.purity <- as.data.frame(lreg.purity.all) %>%
dplyr::rename(
'Raw counts' = HTseq_counts,
FPKM = HTseq_FPKM,
FPKM.UQ = HTseq_FPKM.UQ,
'RUV-III' = RUV_III,
'RUV-III-P' = RUV_III_p
) %>%
dplyr::mutate(pcs = c(1:10)) %>%
tidyr::pivot_longer(
-pcs,
names_to = 'datasets',
values_to = 'r.sq') %>%
dplyr::mutate(
datasets = factor(
datasets,
levels = c(
'Raw counts',
'FPKM',
'FPKM.UQ',
'RUV-III',
'RUV-III-P'))
)
### plot
ggplot(lreg.purity, aes(x = pcs, y = r.sq, group = datasets)) +
geom_line(aes(color = datasets)) +
geom_point(aes(color = datasets)) +
xlab('') +
ylab (expression('R'^'2')) +
scale_color_manual(values = dataSets.colors.3, name = 'Datasets') +
scale_x_continuous(breaks = (1:10), labels = c('PC1', paste0('PC1:', 2:10)) ) +
scale_y_continuous(breaks = scales::pretty_breaks(n = 5), limits = c(0,1)) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(size = 14, angle = 45, vjust = 1, hjust = 1),
axis.text.y = element_text(size = 16),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14))Figure 9.5: A plot showing the R-squared of linear regression between tumour purity and up to the first 10 principal components (taken cumulatively) for different normalization methods.
9.2.4 Association between genes expression and survival
Survival analysis shows that RUV-III-P reveals the association between gene expression of ZEB2, STAB1 and ESRRA and overall survival in the TCGA BRCA RNA-seq data (Figure 9.6).
selected.genes <- c(
'ZEB2',
'STAB1',
'ESRRA')
pp.tcga <- lapply(
selected.genes,
function(x){
p.survival.purity.tcga <- survival_plot(
data = SummarizedExperiment::assay(brca.cancer.se, 'HTseq_FPKM.UQ'),
stratify = 'expr',
annot = SummarizedExperiment::colData(brca.cancer.se),
scoreCol = NULL,
gene = x,
covariate = NULL,
isCategoricalCov = FALSE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = 2,
mainTitle1 = x,
confInt = FALSE,
ylabel = "Survival",
cols = c(brewer.pal(9, "Set1"))[c(2, 3)],
nColLegend = 1,
plotType = "autoplot"
)$plot +
surv.theme
})
pp.ruv <- lapply(
selected.genes,
function(x){
p.survival.purity.tcga <- survival_plot(
data = SummarizedExperiment::assay(brca.cancer.se, 'RUV_III'),
stratify = 'expr',
annot = SummarizedExperiment::colData(brca.cancer.se),
scoreCol = NULL,
gene = x,
covariate = NULL,
isCategoricalCov = FALSE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = 2,
mainTitle1 = x,
confInt = FALSE,
ylabel = "Survival",
cols = c(brewer.pal(9, "Set1"))[c(2, 3)],
nColLegend = 1,
plotType = "autoplot"
)$plot +
surv.theme
})
pp.ruv.p <- lapply(
selected.genes,
function(x){
p.survival.purity.tcga <- survival_plot(
data = SummarizedExperiment::assay(brca.cancer.se, 'RUV_III_p'),
stratify = 'expr',
annot = SummarizedExperiment::colData(brca.cancer.se),
scoreCol = NULL,
gene = x,
covariate = NULL,
isCategoricalCov = FALSE,
timeCol = "OS.time_liu",
eventCol = "OS_liu",
nGroup = 2,
mainTitle1 = x,
confInt = FALSE,
ylabel = "Survival",
cols = c(brewer.pal(9, "Set1"))[c(2, 3)],
nColLegend = 1,
plotType = "autoplot"
)$plot +
surv.theme
})
gridExtra::grid.arrange(
pp.tcga[[1]],
pp.ruv[[1]],
pp.ruv.p[[1]],
pp.tcga[[2]],
pp.ruv[[2]],
pp.ruv.p[[2]],
pp.tcga[[3]],
pp.ruv[[3]],
pp.ruv.p[[3]],
ncol = 3)Figure 9.6: Association between gene expression and overall survival in the differently normalized datasets of the TCGA BRCA RNA-seq data. Plots are for FPKM.UQ, RUV-III and RUV-III-P.
9.2.5 Gene-gene correlation
Here we explore the correlation between pair of genes.
9.2.5.1 Known co-expressed genes
It has been reported that the gene expression of CCNB1 is highly correlated in CENPE,AURKB, PLK1 and PLK4 genes in breast cancer Markus Ringnér et.al. Here, we explore these correlation in different normalized data of the TCGA BRCA RNA-seq data. Figure 9.7 shows the those correlations are preserved in the RUV-III with poorly chosen PRPS.
genes <- c(
'CENPE',
'AURKB',
'PLK1',
'PLK4')
corr.genes <- lapply(
normalizations,
function(x){
data <- SummarizedExperiment::assay(brca.cancer.se, x)
corrs <- sapply(
genes,
function(y){
cor.test(data['CCNB1' , ], data[y, ], method = 'spearman')[[4]]
})
names(corrs) <- genes
corrs
})
names(corr.genes) <- normalizations.names
corr.genes <- data.frame(corr.genes, check.names = F) %>%
dplyr::mutate(genes = row.names(.)) %>%
tidyr::pivot_longer(-genes, names_to = 'datasets', values_to = 'corr') %>%
dplyr::mutate(datasets = factor(datasets, levels = normalizations.names)) %>%
data.frame()
ggplot(corr.genes, aes(x = datasets, y = corr)) +
geom_point() +
facet_wrap( ~ genes) +
ylab('Spearman correlation') +
xlab('') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 14,
angle = 45,
vjust = 1,
hjust = 1
),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 14))Figure 9.7: Spearman correlations between CCNB1 and CENPE,AURKB, PLK1 and PLK4 genes in different normalized data of the TCGA BRCA RNA-seq data.
Here, we explore the correlation between two pairs of genes including CNOT1_E2F4 and ESRRA_MAP3K2 that were discussed in the TCGA BRCA RNA-seq data (Figure 9.8).
genes.list <- list(
pair1 = c('CNOT1', 'E2F4'),
pair2 = c('ESRRA', 'MAP3K2'))
gene.corr <- lapply(
names(genes.list),
function(x){
corrs <- sapply(
normalizations,
function(y){
data <- SummarizedExperiment::assay(brca.cancer.se, y)
cor.test(
data[genes.list[[x]][1], ] ,
data[genes.list[[x]][2], ] ,
method = 'spearman')[[4]]
})
names(corrs) <- normalizations.names
corrs
})
names(gene.corr) <- names(genes.list)
gene.corr <- data.frame(gene.corr)
gene.corr$datasets <- factor(row.names(gene.corr), levels = normalizations.names)
p1 <- ggplot(gene.corr, aes(x = datasets, y= pair1)) +
geom_point(size = 3) +
ylab('Spearman correlation') +
xlab('') +
ggtitle('Correlation between CNOT1 and E2F4') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 14,
angle = 45,
vjust = 1,
hjust = 1
),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14))
p2 <- ggplot(gene.corr, aes(x = datasets, y= pair2)) +
geom_point(size = 3) +
ylab('Spearman correlation') +
xlab('') +
ggtitle('Correlation between ESRRA and MAP3K2') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 14,
angle = 45,
vjust = 1,
hjust = 1
),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 12),
legend.title = element_text(size = 14))
gridExtra::grid.arrange(p1, p2, ncol = 2)Figure 9.8: Spearman correlations for two pairs of genes including CNOT1_E2F4 and ESRRA_MAP3K2.
10 Other normalizations
There are other RNA-seq normalizations including SVAseq, ComBat-seq and RUVg methods that are not specifically designed for normalization, although they can be helpful for that task when the unwanted variation is orthogonal to the biology, something that is rarely known in advance. The same applies to the RUVs method provided in the RUVSeq package. Although if there are true replicates (missing from TCGA and most large cancer RNAseq studies), it can be used to normalize RNA-seq datasets.
Here, we apply ComBat-seq, RUVg and RUVs on the TCGA BRCA RNA-seq data and assess their performance.
10.1 ComBat-seq
We apply the ComBat-seq method on the TCGA BRCA RNA-seq data. We specify plates as known sources of batch effects.
combat_seq.plates <- sva::ComBat_seq(
counts = brca.rawCounts.cancer,
batch = brca.cancer.se$plate_RNAseq)10.1.1 Association between PCs and library size
Figure 10.1 shows that the Combat-seq method did not remove the library size variation in the TCGA BRCA RNA-seq data. This results may suggest that the Combat-seq method is not specifically designed for normalization.
pca.combat <- fast.pca(data = combat_seq.plates, nPcs = 3, is.log = FALSE)
pcs <- data.frame(
PC1 = pca.combat$sing.val$u[,1],
PC2 = pca.combat$sing.val$u[,2],
PC3 = pca.combat$sing.val$u[,3],
ls = brca.cancer.se$libSize,
purity = brca.cancer.se$purity_HTseq_FPKM.UQ)
p1 <- ggplot(pcs, aes(PC1, ls)) +
geom_point() +
ylab('Library size') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 16))
p2 <- ggplot(pcs, aes(PC3, purity)) +
geom_point() +
ylab('Tumour purity') +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = 1),
axis.title.x = element_text(size = 16),
axis.title.y = element_text(size = 16),
axis.text.x = element_text(size = 14),
axis.text.y = element_text(size = 16))
gridExtra::grid.arrange(p1, p2, ncol = 2)Figure 10.1: Scatter plots show that the PC1 and PC3 of the ComBat-seq normalized data are high correlated with library size and purity respectively.
10.2 RUVg
In the RUVSeq R Bioconductor package there are two methods including RUVg and RUVs that produce both the estimated factors of unwanted variation and normalized counts data. The RUVg method uses a set of negative control genes to estimates the factors of unwanted variation which can be used as covariates in differential gene expression analysis. The authors recommend: “the normalized counts can be used just for exploration, as removing the unwanted factors from the counts can also remove part of a factor of interest.”Ref However, we apply RUVg on the TCGA BRCA RNA-seq data.
ruv.g <- RUVSeq::RUVg(
x = brca.rawCounts.cancer,
cIdx = negative.control.genes,
k = 12)
ruv.g <- ruv.g$normalizedCounts10.2.1 Association between gene expression and survival
The survival analyses (Figure 10.2) show that the RUVg method removes the association between gene expression and survival in the TCGA BRCA RNA-seq data.
pca.ruvg <- fast.pca(data = ruv.g, nPcs = 3, is.log = FALSE)
p <- .scatter.density.pc(
pcs = pca.ruvg$sing.val$u[, 1:3],
pc.var = pca.ruvg$var,
group.name = 'PAM50',
group = SummarizedExperiment::colData(brca.cancer.se)[, 'pam50Genefu.ruv'],
color = pam50.colors[2:6],
strokeSize = .2,
pointSize = 2,
strokeColor = 'gray30',
alpha = .6,
title = 'RUVg')
do.call(
gridExtra::grid.arrange,
c(p, ncol = 2)
)Figure 10.2: PCA plots of the RUVg normalized data coloured by the PAM50 subtypes.
10.3 RUVs
Another method in the RUVSeq package is RUVs. This method uses replicate/negative control samples for which the covariates of interest are constant and negative control genes to estimate the factors of unwanted variation. RUVs produces both the factors of unwanted variation that can be used for differential expression analysis and normalized counts data. Generally, this method is used in situations where the genuine replicate/negative control samples are available. For example, an RNA-Seq study that involves control and treatment samples.
However, we apply RUVs on the TCGA BRCA RNA-Seq data and use the consensus molecular subtypes as replicate/negative control samples for RUVs. We also use a set of negative control genes that was used for the RUV-III-PRPS normalization.
rep.samples <- RUVSeq::makeGroups(
SummarizedExperiment::colData(brca.cancer.se)[, 'pam50Genefu.ruv']
)
ruv.s <- RUVSeq::RUVs(
x = brca.rawCounts.cancer,
cIdx = negative.control.genes,
k = 12,
scIdx = rep.samples)
ruv.s <- ruv.s$normalizedCounts10.3.1 Expression levels of several genes
Figure 10.3 shows that RUVs converts the expression levels of some genes with reasonable expression to zero across all samples. For examples, the log2 expression of the LARP7 gene is between 10 to 12.5 in the TCGA raw counts data, while the expression of this gene is almost zero in the RUVs normalized data. These results may suggest that RUVs is not designed for RNA-seq normalization in situations where true technical replicates are not available.
selected.genes <- c(
'IKZF5',
'LYRM2',
'OXCT1',
'ERC1')
pp <- lapply(
selected.genes,
function(g){
df <- data.frame(
Raw.count = brca.rawCounts.cancer[g ,],
RUVs = ruv.s[g ,],
sample = c(1:1086)) %>%
tidyr::pivot_longer(
-sample,
values_to = 'expr',
names_to = 'genes')
ggplot(df, aes(x = sample, y = log2(expr + 1))) +
geom_point() +
ylab('Gene expression') +
xlab('Samples') +
ggtitle(g) +
facet_wrap(~ genes) +
theme(
panel.background = element_blank(),
axis.line = element_line(colour = 'black', size = .85),
axis.title.x = element_text(size = 14),
axis.title.y = element_text(size = 14),
axis.text.x = element_text(
size = 12,
angle = 25,
vjust = 1,
hjust = 1),
axis.text.y = element_text(size = 12),
legend.text = element_text(size = 10),
legend.title = element_text(size = 14),
strip.text.x = element_text(size = 10))
})
do.call(
gridExtra::grid.arrange,
pp)Figure 10.3: Expression levels of several genes in the TCGA BRCA raw counts and the RUVs normalized data.
11 R Session information
options(max.print = 10^4)
sessionInfo()## R version 4.1.3 (2022-03-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
##
## attached base packages:
## [1] parallel grid stats4 stats graphics grDevices utils
## [8] datasets methods base
##
## other attached packages:
## [1] magrittr_2.0.2 future_1.24.0
## [3] ruv_0.9.7.1 BiocSingular_1.10.0
## [5] genefu_2.26.0 AIMS_1.26.0
## [7] e1071_1.7-9 iC10_1.5
## [9] iC10TrainingData_1.3.1 impute_1.68.0
## [11] pamr_1.56.1 cluster_2.1.2
## [13] biomaRt_2.50.3 survcomp_1.44.1
## [15] prodlim_2019.11.13 ppcor_1.1
## [17] MASS_7.3-55 wesanderson_0.3.6
## [19] ggjoy_0.4.1 ggridges_0.5.3
## [21] RColorBrewer_1.1-2 ggfortify_0.4.14
## [23] survival_3.2-13 mclust_5.4.9
## [25] ComplexHeatmap_2.10.0 SummarizedExperiment_1.24.0
## [27] Biobase_2.54.0 GenomicRanges_1.46.1
## [29] GenomeInfoDb_1.30.1 IRanges_2.28.0
## [31] S4Vectors_0.32.3 BiocGenerics_0.40.0
## [33] MatrixGenerics_1.6.0 matrixStats_0.61.0
## [35] dplyr_1.0.8 ggplot2_3.3.5
## [37] cowplot_1.1.1 BiocStyle_2.22.0
## [39] DT_0.21 rmdformats_1.0.3
## [41] knitr_1.37
##
## loaded via a namespace (and not attached):
## [1] rappdirs_0.3.3 rtracklayer_1.54.0 R.methodsS3_1.8.1
## [4] tidyr_1.2.0 bit64_4.0.5 irlba_2.3.5
## [7] aroma.light_3.24.0 DelayedArray_0.20.0 R.utils_2.11.0
## [10] data.table_1.14.2 hwriter_1.3.2 KEGGREST_1.34.0
## [13] RCurl_1.98-1.6 GEOquery_2.62.2 doParallel_1.0.17
## [16] generics_0.1.2 GenomicFeatures_1.46.4 ScaledMatrix_1.2.0
## [19] RSQLite_2.2.10 proxy_0.4-26 bit_4.0.4
## [22] tzdb_0.2.0 xml2_1.3.3 lubridate_1.8.0
## [25] assertthat_0.2.1 viridis_0.6.2 xfun_0.29
## [28] hms_1.1.1 jquerylib_0.1.4 evaluate_0.15
## [31] fansi_1.0.2 restfulr_0.0.13 progress_1.2.2
## [34] dbplyr_2.1.1 readxl_1.3.1 DBI_1.1.2
## [37] htmlwidgets_1.5.4 EDASeq_2.28.0 purrr_0.3.4
## [40] ellipsis_0.3.2 ggpubr_0.4.0 backports_1.4.1
## [43] bookdown_0.24 annotate_1.72.0 vctrs_0.3.8
## [46] remotes_2.4.2 abind_1.4-5 cachem_1.0.6
## [49] withr_2.4.3 GenomicAlignments_1.30.0 prettyunits_1.1.1
## [52] crayon_1.5.0 genefilter_1.76.0 edgeR_3.36.0
## [55] pkgconfig_2.0.3 SuppDists_1.1-9.7 labeling_0.4.2
## [58] nlme_3.1-155 rlang_1.0.1 globals_0.14.0
## [61] lifecycle_1.0.1 filelock_1.0.2 survivalROC_1.0.3
## [64] fastDummies_1.6.3 BiocFileCache_2.2.1 rsvd_1.0.5
## [67] cellranger_1.1.0 graph_1.72.0 Ipaper_0.1.7
## [70] Matrix_1.4-0 carData_3.0-5 singscore_1.14.0
## [73] GlobalOptions_0.1.2 png_0.1-7 viridisLite_0.4.0
## [76] rjson_0.2.21 bootstrap_2019.6 bitops_1.0-7
## [79] R.oo_1.24.0 KernSmooth_2.23-20 Biostrings_2.62.0
## [82] blob_1.2.2 shape_1.4.6 stringr_1.4.0
## [85] ShortRead_1.52.0 parallelly_1.30.0 readr_2.1.2
## [88] jpeg_0.1-9 rstatix_0.7.0 ggsignif_0.6.3
## [91] rmeta_3.0 beachmat_2.10.0 scales_1.1.1
## [94] memoise_2.0.1 GSEABase_1.56.0 plyr_1.8.6
## [97] hexbin_1.28.2 zlibbioc_1.40.0 compiler_4.1.3
## [100] BiocIO_1.4.0 clue_0.3-60 Rsamtools_2.10.0
## [103] cli_3.2.0 XVector_0.34.0 listenv_0.8.0
## [106] mgcv_1.8-39 tidyselect_1.1.2 stringi_1.7.6
## [109] highr_0.9 yaml_2.3.5 locfit_1.5-9.4
## [112] latticeExtra_0.6-29 sass_0.4.0 tools_4.1.3
## [115] future.apply_1.8.1 circlize_0.4.14 rstudioapi_0.13
## [118] foreach_1.5.2 gridExtra_2.3 farver_2.1.0
## [121] digest_0.6.29 BiocManager_1.30.16 lava_1.6.10
## [124] Rcpp_1.0.8 car_3.0-12 broom_0.7.12
## [127] httr_1.4.2 AnnotationDbi_1.56.2 colorspace_2.0-3
## [130] XML_3.99-0.9 splines_4.1.3 xtable_1.8-4
## [133] jsonlite_1.8.0 R6_2.5.1 pillar_1.7.0
## [136] htmltools_0.5.2 glue_1.6.2 fastmap_1.1.0
## [139] BiocParallel_1.28.3 class_7.3-20 RUVSeq_1.28.0
## [142] codetools_0.2-18 utf8_1.2.2 lattice_0.20-45
## [145] bslib_0.3.1 tibble_3.1.6 sva_3.42.0
## [148] curl_4.3.2 clipr_0.8.0 zip_2.2.0
## [151] openxlsx_4.2.5 limma_3.50.1 rmarkdown_2.11
## [154] munsell_0.5.0 GetoptLong_1.0.5 GenomeInfoDbData_1.2.7
## [157] iterators_1.0.14 reshape2_1.4.4 gtable_0.3.0